Computer Science (Information Retrieval)



  1. (12 pts) Consider this dictionary: {“CAT”, “COUNT”, “DOG”, “DONKEY”, “ELEPHANT” }

Term-ID1

Offset

1

2

3

4

5

    1. Complete this table assuming “dictionary as a string”

    1. Create a second dictionary consisting of each word reversed (e.g. CAT -> TAC ). Show the dictionary as a string.

Term-ID2

Offset

1

2

3

4

5

    1. Complete this table using your reversed dictionary string

    1. Using your two dictionaries, show how you can determine the words that satisfy the wildcard query C*T

Doc1: the wood table

Doc2: they made the wood table

Doc3: the table is made of steel

Doc4: wood table or steel table

  1. (15 pts) Consider the following documents:

Using a shingle size 2, compute the Jaccard coefficient of:

(Doc1, Doc2)

(Doc1, Doc3)

(Doc1, Doc4)

Based upon your results, Doc1 is most similar to ____?


  1. ( 6 pts) Consider the following text:

This tree is just one of many older-growth trees in the forest. Forests in Texas, can be over 100 years-old before they are considered “old”. Trees can be over 200 years.

    1. What punctuation can be removed to determine terms?

    1. What stop words can be removed?

    1. Which tokens can be converted to lower case?