1. Create the term frequency matrix for the four document titles below:
In building this matrix you should retain every word (i.e., there is no stopword list). The indexed units should be individual words, all of which should be converted to lower case. You should not apply stemming or other morphological processes to the words when creating the index terms. Your term-by-document matrix should have one row for each unique term and one column for each document. The entry in each cell should be the number of occurrences of the term that is associated with the element's row in the document that is associated with the element's column. You may leave elements for which the value is zero blank if you like or you may place a zero there. Turn in this matrix.
Next, use the term-by-document matrix that you just created to perform boolean retrieval.
2. Create a new term frequency matrix from the matrix you prepared in question 1: each cell in this matrix should have a 1 in positions where the term frequency is not zero and a zero in places where the term frequency is zero in the original matrix (blanks are not a good idea here because we will need to see the zeros). Turn in this matrix. Use this new matrix for answering the rest of the questions in part 1.
3. Extract the rows for the terms "information" and "retrieval" from the matrix and write them down, one above the other. Each should have four binary digits (0 or 1) in them. Then draw a horizontal line under them and compute the AND of the two rows. This should produce a single row in which a 1 appears for each document that has both "information" and "retrieval". What are these documents? Show your work.
As a reminder, the truth table for (a AND b) is:
\B 0 | 1 A +------- 0 | 0 | 0 --+---+--- 1 | 0 | 1
4. Perform the same computation for the following queries:
Refer to lecture notes if you need the truth tables for the OR and NOT operators. Present your computation in any matter that you find clear.
5. Perform the same computation for the query:
(information XOR system)
where the truth table for (a XOR b) is:
\b 0 | 1 a +------- 0 | 0 | 1 --+---+--- 1 | 1 | 0
Show the computation in the same form that you presented the answer to question 3.
For the following questions, use the term-by-document matrix provided below to perform vector space retrieval (blanks indicate zeros):
d1 d2 d3 +---+---+---+ t1 | | | 5 | +---+---+---+ t2 | 4 | 1 | 3 | +---+---+---+ t3 | 5 | | 4 | +---+---+---+ t4 | 6 | 3 | 3 | +---+---+---+ t5 | | 1 | | +---+---+---+ t6 | 3 | | 7 | +---+---+---+ t7 | | 6 | 1 | +---+---+---+ t8 | 2 | | | +---+---+---+
1. Build the w matrix in which each element is computed as TF * IDF, where TF is what is specified in each element above and the IDF for term i is computed as:
log (total_number_of_documents / number_of_documents_containing_term_i)
Use base 10 logarithms. Microsoft Windows systems have a scientific calculator built in (select "scientific" from the menu) if you need one. Turn in this matrix.
2. Build the w' matrix by applying cosine normalization to the w matrix that you just created. This is found by dividing each element in the matrix by the square root of the sum of the squares of every element in the same column as the element in question. A formula for this that might be clearer but which says exactly the same thing is provided in the notes. Turn in this matrix.
3. Using the w' matrix computed in question 2, compute the rank order of the documents that would be found using the vector space method for the UNWEIGHTED query:
t2 t7
In your answer, give the similarity score that is computed for each document and give the ranked list that the system would return.
4. Compare your results for questions 1 through 4 to those presented in class for the same matrix (the starting point was the left three columns of the matrix used in the class slides) and BRIEFLY explain why any answers you obtained (weights, similarity scores, or rank orderings) were different.
Please post your assignment on your Website.
Acknoledgement to Doug Oard (INST 734).