Homework #1B [Due Mar 4]

[Thanks to James Allen and Jimmy Lin]

Please Read Chapter 3 on Evaluation.

Problem 2 (20 points) [Continuing Problem 1]


In the second part of this assignment, for one of the two topics [YOUR CHOICE--make this clear], you will

  1. analyze agreement on the relevance judgments
  2. adjudicate the judgments
  3. use the adjudicated set to evaluate both MSNLive and Google

2.1. Agreement on Relevance Judgments


There should be judgments from N different people for each document (Web page). The first question you'll answer is: How often do judges agree on relevance? There are four possibilities:

For your chosen topic, figure out how often each case happens (both in terms of counts and in terms of percentage). Turn this information in. Pick two cases where judgments about a particular topic are not uniform, and briefly speculate why this may be so. Turn this in.

2.2 Adjudication

Adjudication is simply the process of reconciling inconsistent judgments. Do this by simple majority voting. If you have an equal split for a particular document then simply pick one randomly. The result should be something like this:
G N Gardening for dummies
G R How to deal with wet soil conditions
...

Turn your adjudicated relevance judgments in.

2.3. Evaluation of MSNLive and Google

Now, evaluate Live and Google using the adjudicated relevance judgments you just created (for the topic you chose). Issue the query to Google and Live again, and examine the top 20 hits. Turn in the following information for both search system:

In addition, answer the following questions:

Problem 3 (15 points)

3A. The following list of R’s and N’s represents relevant (R) and non-relevant (N) documents in a ranked list of 50 documents. The “top” of the ranked list is on the left of the list, so that represents the most highly weighted document, the one that the system believes is most likely to be relevant. The list runs across the page to the right and then starts on the next line. This list shows 8 relevant documents. Assume that there are an additional two (2) relevant documents that were not retrieved by this system.


R R N N N R N N N N        R N N N R N N N N R       R N N N N N N N R N
N N N N N N N N N N        N N N N N N N N N N


Based on that list, calculate the following measures:

  1. Average precision
  2. Precision at 50% recall
  3. Precision at 33% recall (interpolated)
  4. Assuming the ordering is simple, search length for n=4
  5. Assuming a weak ordering with five values for ranking (resulting in the five groups of ten shown), what is the expected search length for n=4. That is, for this question, assume that documents in the first group all have the same score, the second group has another score, and so on.
  6. What is the largest and smallest possible error in Average Precision for the above system?

3B. Now, Imagine another system retrieves the following ranked list for the same query.


R R R R N N N N N N        N N N N N N N N N N         N N N N N N N N N N N
N N N N N N N N N N        N N N N N N N N N R

Repeat parts (3.A.1), (3.A.2) and (3.A.3) for the above ranked list. Compare the two ranked lists on the basis of these 3 metrics that you have computed—i.e., if you were given only these 3 numbers (Average Precision, Precision at 50% recall and Precision at 33% recall) what can you determine about the relative performance of the two systems in general?


3C. Plot a recall/precision graph for the above two systems. Generate both an uninterpolated and an interpolated graph (probably as two graphs to make the four plots easier to see). What do the graphs tell you about the system in A and the one in B?