StudyDaddy Computer Science

Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.

QUESTION

Mar 20, 2017

Crawler4j

Use Crawler4j( java)

Review , fix and run the crawler.

Add code for additional requiments.

Make sure you crawler does the following.

Test your crawler only on the data in:

http://lyle.smu.edu/~fmoore

Make sure that your crawler is not allowed to get out of this directory!!! Yes, there is a robots.txt file that must be used. Note that it is in a non-standard location.

The required input to your program is N, the limit on the number of pages to retrieve and a list of stop words (of your choosing) to exclude.

Perform case insensitive matching.

You can assume that there are no errors in the input. Your code should be robust under errors in the Web pages you're searching. If an error is encountered, feel free, if necessary, just to skip the page where it is encountered.

1. Identify the key properties of a web crawler. Describe in detail how each of these properties is implemented in your code.

2. Use your crawler to list the URL of all pages in the test data and report all out-going links of the test data. [10 points] display the contents of the <TITLE> tag

3. Implement duplicate detection, and report if any URLs refer to already seen content.

4. Use your crawler to list all broken links within the test data.

5. How many graphic files are included in the test data?

6. Have your crawler save the words from each page of type (.txt, .htm, .html). Make sure that you do not save HTML markup. Explain your definition of “word”. In this process, give each page a unique document ID. [25 points] implement stemming

7. Report the 20 most common words with its document frequency. [15 points] words or stemmed words?

Files: yrs0z3cnma.zip