Abstract

We present an application of Hidden Markov Models to supervised document classification and ranking. We consider a family of models that take into account the fact that relevant documents may contain irrelevant passages; the originality of the model is that it does not explicitly segment documents but rather considers all possible segmentations in its final score. This model generalizes the multinomial Naive Bayes and it is derived from a more general model for different access tasks. The model is evaluated on the REUTERS test collection and compared to the multinomial Naive Bayes model. It is shown to be more robust with respect to the training set size and to improve the performance both for ranking and classification, specially for classes with few training examples.

Keywords

Ranking (information retrieval)Hidden Markov modelComputer scienceMultinomial distributionArtificial intelligenceNaive Bayes classifierMachine learningSet (abstract data type)Document classificationTraining setBayes' theoremBayesian probabilityPattern recognition (psychology)Natural language processingMathematicsStatisticsSupport vector machine

Affiliated Institutions

Related Publications

Publication Info

Year
2001
Type
preprint
Citations
40
Access
Closed

External Links

Citation Metrics

40
OpenAlex

Cite This

Ludovic Denoyer, Hugo Zaragoza, Patrick Gallinari (2001). HMM-based passage models for document classification and ranking. .