Computer Science (Information Retrieval)
(12 pts) Consider this dictionary: {“CAT”, “COUNT”, “DOG”, “DONKEY”, “ELEPHANT” }
Term-ID1 | Offset |
1 |
|
2 |
|
3 |
|
4 |
|
5 |
|
Complete this table assuming “dictionary as a string”
Create a second dictionary consisting of each word reversed (e.g. CAT -> TAC ). Show the dictionary as a string.
Term-ID2 | Offset |
1 |
|
2 |
|
3 |
|
4 |
|
5 |
|
Complete this table using your reversed dictionary string
Using your two dictionaries, show how you can determine the words that satisfy the wildcard query C*T
Doc1: the wood table Doc2: they made the wood table Doc3: the table is made of steel Doc4: wood table or steel table |
(15 pts) Consider the following documents:
Using a shingle size 2, compute the Jaccard coefficient of:
(Doc1, Doc2)
(Doc1, Doc3)
(Doc1, Doc4)
Based upon your results, Doc1 is most similar to ____?
( 6 pts) Consider the following text:
This tree is just one of many older-growth trees in the forest. Forests in Texas, can be over 100 years-old before they are considered “old”. Trees can be over 200 years.
What punctuation can be removed to determine terms?
What stop words can be removed?
Which tokens can be converted to lower case?