StudyDaddy Computer Science

Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.

QUESTION

Mar 04, 2017

Web

1. Complete code in a compressed archive (zip, tgz, etc)

2. A readme file with complete description of used software, installation, compilation and execution instructions

3. A document with the results for the questions below.

Task:

Develop a specialized Web crawler.

Test your crawler only on the data in:

Make sure that your crawler is not allowed to get out of this directory!!! Yes, there is a robots.txt file that must be used. Note that it is in a non-standard location.

The required input to your program is N, the limit on the number of pages to retrieve and a list of stop words (of your choosing) to exclude.

Perform case insensitive matching.

You can assume that there are no errors in the input. Your code should be robust under errors in the Web pages you're searching. If an error is encountered, feel free, if necessary, just to skip the page where it is encountered.

Efficiency: Don't be ridiculously inefficient. There's no need to deliver turbo-charged algorithms or implementations. You don't need to worry about memory constraints; if your program runs out of space and dies on encountering a large file, that's OK. You do not have to use multiple threads; sequential downloading is OK.