Abstract

WHIRL is an extension of relational databases that can perform “soft joins ” based on the similarity of textual identifiers; these soft joins extend the traditional operation of joining tables based on the equivalence of atomic values. This paper evaluates WHIRL on a number of inductive classification tasks using data from the World Wide Web. We show that although WHIRL is designed for more general similaritybased reasoning tasks, it is competitive with mature inductive classification systems on these classification tasks. In particular, WHIRL generally achieves lower generalization error than C4.5, RIPPER, and several nearest-neighbor methods. WHIRL is also fast-p to 500 times faster than C4.5 on some benchmark problems. We also show that WHIRL can be efficiently used to select from a large pool of unlabeled items those that can be classified correctly with high confidence.

Keywords

JoinsComputer scienceGeneralizationIdentifierSimilarity (geometry)Artificial intelligenceBenchmark (surveying)Data miningEquivalence (formal languages)Mathematics

Affiliated Institutions

Related Publications

The multiscale classifier

Proposes a rule-based inductive learning algorithm called multiscale classification (MSC). It can be applied to any N-dimensional real or binary classification problem to classi...

1996 IEEE Transactions on Pattern Analysis... 50 citations

Publication Info

Year
1998
Type
article
Pages
169-173
Citations
105
Access
Closed

External Links

Citation Metrics

105
OpenAlex

Cite This

William W. Cohen, Haym Hirsh (1998). Joins that generalize: text classification using WHIRL. , 169-173.