Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.

QUESTION

Introduction to Informatin Retrieval

0. Approximately how many hours did you spent on this homework?

1. (5/5 points) Locate the article “Reimagining Search’ by Alex Wright (CACM June 2016). 

a. What type of study was conducted by researchers at Microsoft?

b. How many people participate in the Google Daily Information Needs study?

2. (15/15 points) Compute the hit list for ((paris AND NOT france) OR lear). Show each step.

3. (15/0 points) List alphabetically all case-normalized tokens and terms in the text below:

In late 1991, The Dallas Morning News became the lone major newspaper in the Dallas market when the Dallas Times-Herald was closed. The closure was after several years of circulation wars between the two papers, especially over the then-burgeoning classified advertising market.

4. (20/20 points) Use an implementation of the Porter Stemmer from http://tartarus.org/martin/PorterStemmer/.  Run your selected implementation on the text below:

The New York Times (NYT) is an American daily newspaper, founded and continuously published in New York City since 1851. It has won 108 Pulitzer Prizes, more than any other news organization. Its website, nytimes.com, is America’s most popular newspaper site, receiving more than 30 million unique visitors per month.

Organ transplantation is the moving of an organ from one body to another or from a donor site to another location on the patient’s own body, for the purpose of replacing the recipient’s damaged or absent organ. The emerging field of regenerative medicine is allowing scientists and engineers to create organs to be re-grown from the patient’s own cells (stem cells, or cells extracted from the failing organs). Organs and/or tissues that are transplanted within the same person’s body are called autografts. Transplants that are recently performed between two subjects of the same species are called allografts. Allografts can either be from a living or cadaveric source.

a. Take a look at the implementation. Indicate which rule(s) are used (spell out the lines of code that implement the rule(s), and what they do) to transform

• website into websit • engineers into engine • continuously into continu

b. Find two words from the above text that are stemmed into the same sequence of characters even though (theoretically) they should not.

5. (15/15 points) Consider the phrase “pizza with pepperoni”. 

a. One technique to implement this is using 2 biword phrases “pizza with” and “with pepperoni”.  Consider just the first page of results from using Bing and Yahoo.  How many results (not ads) are on each page?  How many are relevant? How many results are common to both search engines?

b. Another technique to implement this is using an exact phrase “pizza with pepperoni”.  Consider just the first page of results from using Bing and Yahoo.  How many results (not ads) are on each page?  How many are relevant? How many results are common to both search engines?

c. In your opinion, which search engine produced the ‘better’ results?  Why?

6.  (10/10 points) Web search engines A and B each crawl a random subset of the same size of the Web. Some of the pages crawled are duplicates yet different URLs. Assume that duplicates are distributed uniformly amongst the pages crawled by A and B. Assume a duplicate is a page that has exactly two copies. A indexes pages without duplicate elimination whereas B indexes only one copy of each duplicate page. If 45% of A’s indexed URLs are present in B’s index, and 50% of B’s indexed URLs are present in A’s index, what fraction of the Web consists of pages that do not have a duplicate?

7. (10/10 points) Why is it better to partition hosts (rather than individual URLs) between the nodes of a distributed crawl system?

8. (10/10 points) Consider the token “lke”. Since this does not match a word in the dictionary, it must be misspelled.  What do you consider to be the correct word?  Explain your reasoning

Graduate students: 

9. (0/15 points) Telephone numbers can be expressed in various formats such as +1 (800) 123-4567, (800) 123-4567, and 123-4567

Write and implement (or if found online, cite source) a regular expression that can detect telephone numbers.  Are there formats other than the examples provided above, that your implementation handles?  Provide the source code and the results of your testing.

(END)

Show more
LEARN MORE EFFECTIVELY AND GET BETTER GRADES!
Ask a Question