TREC: Improving Information Access through Evaluation (01 Nov 2005)
One objection to test collections that dates back to the Cranfield tests is the use of relevance judgments as the basis for evaluation. Relevance is known to be very idiosyncratic, and critics question how an evaluation methodology can be based on such an unstable foundation. An experiment using the TREC-4 and TREC-6 retrieval results investigated the effect of changing relevance assessors on system comparisons. The experiment demonstrated that the absolute scores for evaluation measures did change when different relevance assessors were used, but the relative scores between runs did not change. That is, if system A evaluated as better than system B using one set of judgments, then system A almost always evaluated as better than system B using a second set of judgments (the exception was in the case where the two runs evaluated as so similar to one another that they should be deemed equivalent). The stable comparisons result held for different evaluation measures and for different kinds of assessors and was independent of whether a judgment was based on a single assessorís opinion or was the consensus opinion of a majority of assessors.
Article URL: http://www.asis.org/Bulletin/Oct-05/voorhees.html
Read 128 more articles from American Society for Information Science sorted by
Next Article: Why People Don't Read Online and What to do About It