Abstract

This article is a study of techniques for bias reduction of estimates of risk both globally and within terminal nodes of CARTR classification trees. In Section 5.4 of Classification and Regression Trees, Leo Breiman presented an estimator that has two free parameters. An empirical Bayes method was put forth for estimating them. Here we explain why the estimator should be successful in the many examples for which it is. We give numerical evidence from simulations in the two-class case with attention to ordinary resubstitution and seven other methods of estimation. There are 14 sampling distributions, all but one simulated and the remaining concerning E. coli promoter regions. We report on varying minimum node sizes of the trees; prior probabilities and misclassification costs; and, when relevant, the numbers of bootstraps or cross-validations. A variation of Breiman's method in which repeated cross-validation is employed to estimate global rates of misclassification was the most accurate from among the eight methods. Exceptions are cases for which the Bayes risk of the Bayes rule is small. For them, either a local bootstrap .632 estimate or Breiman's method modified to use a bootstrap estimate of the global misclassification rate is most accurate.

Keywords

EstimatorBayes' theoremStatisticsComputer scienceNaive Bayes classifierMathematicsEstimationBayes classifierBayes error rateArtificial intelligenceBayesian probabilitySupport vector machine

Affiliated Institutions

Related Publications

Bagging, boosting, and C4.S

Breiman's bagging and Freund and Schapire's boosting are recent methods for improving the predictive power of classifier learning systems. Both form a set of classifiers that ar...

1996 National Conference on Artificial Int... 1262 citations

Publication Info

Year
2002
Type
article
Volume
11
Issue
2
Pages
263-288
Citations
16
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

16
OpenAlex

Cite This

D. Blöch, Richard A. Olshen, Michael G. Walker (2002). Risk Estimation for Classification Trees. Journal of Computational and Graphical Statistics , 11 (2) , 263-288. https://doi.org/10.1198/106186002760180509

Identifiers

DOI
10.1198/106186002760180509