CLOUDS: a decision tree classifier for large datasets

Abstract

Classification for very large datasets has many practical applications in data mining. Techniques such as discretization and dataset sampling can be used to scale up decision tree classifiers to large datasets. Unfortunately, both of these techniques can cause a significant loss in accuracy. We present a novel decision tree classifier called CLOUDS, which samples the splitting points for numeric attributes followed by an estimation step to narrow the search space of the best split. CLOUDS reduces computation and I/O complexity substantially compared to state of the art classifiers, while maintaining the quality of the generated trees in terms of accuracy and tree size. We provide experimental results with a number of real and synthetic datasets.

Keywords

Decision treeComputer scienceClassifier (UML)Decision tree learningComputationData miningIncremental decision treeArtificial intelligenceTree (set theory)Machine learningPattern recognition (psychology)Logistic model treeAlgorithmMathematics

Affiliated Institutions

Related Publications

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

Hanchuan Peng , Fuhui Long , Chen Ding

Feature selection is an important problem for pattern classification systems. We study how to select good features according to the maximal statistical dependency criterion base...

2005 IEEE Transactions on Pattern Analysis... 10050 citations

Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)

Jerome H. Friedman , Trevor Hastie , Robert Tibshirani

Boosting is one of the most important recent developments in\nclassification methodology. Boosting works by sequentially applying a\nclassification algorithm to reweighted versi...

2000 The Annals of Statistics 6819 citations

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Davide Chicco , Giuseppe Jurman

Abstract Background To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the ...

2020 BMC Genomics 5067 citations

A survey on Image Data Augmentation for Deep Learning

Connor Shorten , Taghi M. Khoshgoftaar

Abstract Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfi...

2019 Journal Of Big Data 11041 citations

Item-based top-<i>N</i>recommendation algorithms

Mukund Deshpande , George Karypis

The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems ---a personalized information filtering technology u...

2004 ACM Transactions on Information Systems 2164 citations

Publication Info

Year: 1998
Type: article
Pages: 2-8
Citations: 148
Access: Closed

External Links

Citation Metrics

148

OpenAlex

Cite This

APA Style

                            
                                    Khaled Alsabti, 
                                
                                    Sanjay Ranka, 
                                
                                    Vineet Kumar Singh
                                
                            (1998). 
                            CLOUDS: a decision tree classifier for large datasets. 
                            Syracuse University Libraries (Syracuse University)
                            
                            , 2-8.