[Thanks to James Allen and Jimmy Lin]
Please Read Chapter 3 on Evaluation.
In the second part of this assignment, for one of the two topics
[YOUR CHOICE--make this clear], you will
There should be judgments from N different people for each document (Web page).
The first question you'll answer is: How often do judges agree on relevance?
There are four possibilities:
For your chosen topic, figure out how often each case happens (both in terms of counts and in terms of percentage). Turn this information in. Pick two cases where judgments about a particular topic are not uniform, and briefly speculate why this may be so. Turn this in.
Adjudication is simply the process
of reconciling inconsistent judgments. Do this by simple majority voting. If
you have an equal split for a particular document then simply pick one randomly.
The result should be something like this:
G N Gardening for dummies
G R How to deal with wet soil conditions
...
Turn your adjudicated relevance judgments in.
Now, evaluate Live and Google using the adjudicated relevance judgments you just created (for the topic you chose). Issue the query to Google and Live again, and examine the top 20 hits. Turn in the following information for both search system:
In addition, answer the following questions:
3A. The following list of R’s and N’s represents relevant (R) and non-relevant (N) documents in a ranked list of 50 documents. The “top” of the ranked list is on the left of the list, so that represents the most highly weighted document, the one that the system believes is most likely to be relevant. The list runs across the page to the right and then starts on the next line. This list shows 8 relevant documents. Assume that there are an additional two (2) relevant documents that were not retrieved by this system.
R R N N N R N N N N R N N N R N N N N R R N N N N N N N R N
N N N N N N N N N N N N N N N N N N N N
Based on that list, calculate the following measures:
3B. Now, Imagine another system retrieves the following ranked list for the same query.
R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N N
N N N N N N N N N N N N N N N N N N N R
Repeat parts (3.A.1), (3.A.2) and (3.A.3) for the above ranked list. Compare the two ranked lists on the basis of these 3 metrics that you have computed—i.e., if you were given only these 3 numbers (Average Precision, Precision at 50% recall and Precision at 33% recall) what can you determine about the relative performance of the two systems in general?
3C. Plot a recall/precision graph for the above two systems. Generate both
an uninterpolated and an interpolated graph (probably as two graphs to make
the four plots easier to see). What do the graphs tell you about the system
in A and the one in B?