Abstract
Abstract Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories and do not consider documents misclassified into categories that are similar or not far from the correct categories in the category tree. In this paper, we therefore propose new performance measures for hierarchical classification. The proposed performance measures consist of category similarity measures and distance‐based measures that consider the contributions of misclassified documents. Our experiments on hierarchical classification methods based on SVM classifiers and binary Naïve Bayes classifiers showed that SVM classifiers perform better than Naïve Bayes classifiers on Reuters‐21578 collection according to the extended measures. A new classifier‐centric measure called blocking measure is also defined to examine the performance of subtree classifiers in a top‐down level‐based hierarchical classification method.
Keywords
Affiliated Institutions
Related Publications
Hierarchical classification of Web content
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train dif...
Enhanced hypertext categorization using hyperlinks
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword a...
HMM-based passage models for document classification and ranking
We present an application of Hidden Markov Models to supervised document classification and ranking. We consider a family of models that take into account the fact that relevant...
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
The growing problem of unsolicited bulk e-mail, also known as "spam", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mos...
Using Maximum Entropy for Text Classification
This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety o...
Publication Info
- Year
- 2003
- Type
- article
- Volume
- 54
- Issue
- 11
- Pages
- 1014-1028
- Citations
- 70
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1002/asi.10298