Selecting Intelligent Topics for Cost-Effective Information Retrieval: A Fresh View on Deep vs. Shallow Judging
DOI:
https://doi.org/10.70705/ppp.ir.2024.v02.i01.pp39-45Keywords:
Information retrieval (IR)Abstract
The foundation of Cranfield-based evaluation of information retrieval (IR) systems is test collections, but with today’s massive
document collections (e.g., ClueWeb12’s 700M+ Webpages), it is practically impossible to use traditional pooling techniques to
build test collections. This has spurred a surge of research suggesting more efficient and trustworthy ways to assess IR. In order
to decrease the amount of search topics—and the associated costs of human relevance judgments—needed for credible IR
assessment, we provide a novel intelligent topic selection approach in this study. We bring together two areas of study that were
previously separate to conduct a thorough evaluation of our method: intelligent topic selection and deep vs. shallow judging.
This refers to the question of whether it is more cost-effective to gather a large number of relevant judgments for a small number
of subjects or a smaller number of judgments for a large number of topics. Previous research on the subject of intelligent
topic selection has not been tested against baselines for shallow judgment; nonetheless, studies comparing deep and shallow
judging have mostly supported shallow judging, supposing that topics are randomly selected. We contend that the ultimate question
to address when assessing a subject selection approach is whether it is beneficial to choose topics or to conduct superficial
judgments across several topics. We undertake an exhaustive research across a range of pertinent criteria never before examined
together in an effort to arrive at a rigorous solution to this general question: 1) how to choose topics; 2) how familiarity with a
subject affects the speed with which humans judge; and 3) the effects on budget usage and the quality of judgments made by
various topic generating techniques (requiring varied amounts of human work). Experiments conducted on the NIST TREC
Robust 2003 and Robust 2004 test collections demonstrate that IR systems can be accurately evaluated using fewer topics. The
results also show that 1) intelligent topic selection makes deep judging more cost-effective than shallow judging in terms of
evaluation reliability, and 2) the evaluation cost vs. reliability trade-off is heavily influenced by topic familiarity and topic gener-
ation costs. By demonstrating that, with intelligent subject selection, deep judgment is often better than superficial judging,
our results contradict common belief.


