Abstract
We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in co-occurrence data. In this scheme, multiple clustering systems are generated aiming at maximizing an objective function that measures multiple pairwise mutual information between cluster variables. To implement this idea, we propose an algorithm that interleaves top-down clustering of some variables and bottom-up clustering of the other variables, with a local optimization correction routine. Focusing on document clustering we present an extensive empirical study of two-way, three-way and four-way applications of our scheme using six real-world datasets including the 20 News-groups (20NG) and the Enron email collection. Our multi-way distributional clustering (MDC) algorithms consistently and significantly outperform previous state-of-the-art information theoretic clustering algorithms.
Keywords
Affiliated Institutions
Related Publications
Scaling clustering algorithms to large databases
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable cluste...
Multi-way Clustering on Relation Graphs
A number of real-world domains such as social networks and e-commerce involve heterogeneous data that describes relations between multiple classes of entities.Understanding the ...
Information-theoretic co-clustering
Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in continge...
Comment-based multi-view clustering of web 2.0 items
Clustering Web 2.0 items (i.e., web resources like videos, images) into semantic groups benefits many applications, such as organizing items, generating meaningful tags and impr...
Multiple sequence alignment with hierarchical clustering
An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers. The approach is bas...
Publication Info
- Year
- 2005
- Type
- article
- Pages
- 41-48
- Citations
- 104
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1145/1102351.1102357