Abstract
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.
Keywords
Affiliated Institutions
Related Publications
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization
In this work we investigate the usefulness of n-grams for document indexing in text categorization (TCi We call-gram a set g k of n word stems, and we say that g k occurs in a d...
Machine learning in automated text categorization
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of ...
Subject and citation indexing. Part II: The optimal, cluster-based retrieval performance of composite representations
Measures of cluster-based retrieval effectiveness are computed for five composite representations in the cystic fibrosis (CF) Document Collection. The composite representations ...
Searching for information in a hypertext medical handbook
Medicine is an ideal domain for hypertext applications and research. Implementing a popular medical handbook in hypertext underscores the need to study hypertext in the context ...
Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques
We present sentiment analyzer (SA) that extracts sentiment (or opinion) about a subject from online text documents. Instead of classifying the sentiment of an entire document ab...
Publication Info
- Year
- 1998
- Type
- article
- Pages
- 307-318
- Citations
- 775
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1145/276304.276332