Machine Learning Benchmarks and Random Forest Regression

Abstract

Breiman (2001a,b) has recently developed an ensemble classification and regression approach that displayed outstanding performance with regard prediction error on a suite of benchmark datasets. As the base constituents of the ensemble are tree-structured predictors, and since each of these is constructed using an injection of randomness, the method is called ârandom forestsâ. That the exceptional performance is attained with seemingly only a single tuning parameter, to which sensitivity is minimal, makes the methodology all the more remarkable. The individual trees comprising the forest are all grown to maximal depth. While this helps with regard bias, there is the familiar tradeoff with variance. However, these variability concerns were potentially obscured because of an interesting feature of those benchmarking datasets extracted from the UCI machine learning repository for testing: all these datasets are hard to overfit using tree-structured methods. This raises issues about the scope of the repository.\n With this as motivation, and coupled with experience from boosting methods, we revisit the formulation of random forests and investigate prediction performance on real-world and simulated datasets for which maximally sized trees do overfit. These explorations reveal that gains can be realized by additional tuning to regulate tree size via limiting the number of splits and/or the size of nodes for which splitting is allowed. Nonetheless, even in these settings, good performance for random forests can be attained by using larger (than default) primary tuning parameter values.

Keywords

OverfittingRandom forestComputer scienceRandomnessMachine learningBenchmark (surveying)Boosting (machine learning)BenchmarkingArtificial intelligenceRegressionEnsemble learningTree (set theory)Feature (linguistics)SuiteVariance (accounting)Data miningStatisticsMathematicsArtificial neural network

Affiliated Institutions

University of California, San Francisco US

Related Publications

Classification and Regression by randomForest

Andy Liaw , Matthew C. Wiener

Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, ...

2007 18390 citations

An empirical comparison of supervised learning algorithms

Rich Caruana , Alexandru Niculescu-Mizil

A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlo...

2006 2655 citations

A Communication-Efficient Parallel Algorithm for Decision Tree

Qi Meng , Guolin Ke , Taifeng Wang +4 more

Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and...

2016 arXiv (Cornell University) 69 citations

A working guide to boosted regression trees

Jane Elith , John R. Leathwick , Trevor Hastie

1. Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonline...

2008 Journal of Animal Ecology 6183 citations

Hyperparameters and tuning strategies for random forest

Philipp Probst , Marvin N. Wright , Anne-Laure Boulesteix +3 more

The random forest (RF) algorithm has several hyperparameters that have to be set by the user, for example, the number of observations drawn randomly for each tree and whether th...

2019 Wiley Interdisciplinary Reviews Data ... 1218 citations

Publication Info

Year: 2004
Type: article
Citations: 682
Access: Closed

External Links

Citation Metrics

682

OpenAlex

Cite This

APA Style

                            
                                    Mark R. Segal
                                
                            (2004). 
                            Machine Learning Benchmarks and Random Forest Regression. 
                            eScholarship (California Digital Library)
                            
                            .