Abstract
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a Ø 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50% vocabulary redu...
Keywords
Related Publications
Automated learning of decision rules for text categorization
We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatical...
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization
In this work we investigate the usefulness of n-grams for document indexing in text categorization (TCi We call-gram a set g k of n word stems, and we say that g k occurs in a d...
RCV1: A New Benchmark Collection for Text Categorization Research
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this da...
Enhanced hypertext categorization using hyperlinks
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword a...
Support vector machines
My first exposure to Support Vector Machines came this spring when heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue's...
Publication Info
- Year
- 1997
- Type
- article
- Pages
- 412-420
- Citations
- 4766
- Access
- Closed