Document Language Models, Query Models, and Risk Minimization for Information Retrieval

John Lafferty; ChengXiang Zhai

doi:10.1145/3130348.3130375

Abstract

We present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling approach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk minimization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC collections and compared to the basic language modeling approach and vector space models together with query expansion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data.

Keywords

Computer scienceQuery expansionQuery languageLanguage modelRanking (information retrieval)Information retrievalWeb query classificationQuery optimizationRDF query languageDivergence-from-randomness modelWeb search queryVector space modelArtificial intelligenceNatural language processingProbabilistic logicData miningSearch engine

Affiliated Institutions

Carnegie Mellon University US

Related Publications

A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

Thorsten Joachims

A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization ...

1997 1265 citations

Understanding the Behaviors of BERT in Ranking

Yifan Qiao , Chenyan Xiong , Zhenghao Liu +1 more

This paper studies the performances and behaviors of BERT in ranking tasks. We explore several different ways to leverage the pre-trained BERT and fine-tune it on two ranking ta...

2019 arXiv (Cornell University) 145 citations

FIRST: Flexible Information Retrieval System for Text

Robert T. Dattola

Abstract An on‐line document retrieval system is described which combines a data base management system with automatic processing of natural language queries and abstracts. Data...

1979 Journal of the American Society for I... 30 citations

Skip-Thought Vectors

Ruslan Salakhutdinov , Richard S. Zemel , Antonio Torralba +4 more

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tri...

2015 arXiv (Cornell University) 723 citations

HMM-based passage models for document classification and ranking

Ludovic Denoyer , Hugo Zaragoza , Patrick Gallinari

We present an application of Hidden Markov Models to supervised document classification and ranking. We consider a family of models that take into account the fact that relevant...

2001 40 citations

Publication Info

Year: 2017
Type: article
Volume: 51
Issue: 2
Pages: 251-259
Citations: 772
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Document Language Models, Query Models, and Risk Minimization for Information Retrieval

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

772

OpenAlex

Cite This

APA Style

                            
                                    John Lafferty, 
                                
                                    ChengXiang Zhai
                                
                            (2017). 
                            Document Language Models, Query Models, and Risk Minimization for Information Retrieval. 
                            ACM SIGIR Forum
                            , 51
                            (2)
                            , 251-259.
                            https://doi.org/10.1145/3130348.3130375

Identifiers

DOI: 10.1145/3130348.3130375