The Little Bootstrap and other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error

Abstract

Abstract When a regression problem contains many predictor variables, it is rarely wise to try to fit the data by means of a least squares regression on all of the predictor variables. Usually, a regression equation based on a few variables will be more accurate and certainly simpler. There are various methods for picking "good" subsets of variables, and programs that do such procedures are part of every widely used statistical package. The most common methods are based on stepwise addition or deletion of variables and on "best subsets." The latter refers to a search method that, given the number of variables to be in the equation (say, five), locates that regression equation based on five variables that has the lowest residual sum of squares among all five variable equations. All of these procedures generate a sequence of regression equations, the first based on one variable, the next based on two variables, and so on. Each member of this sequence is called a submodel and the number of variables in the equation is the dimensionality of the submodel. A complex problem is determining which submodel of the generated sequence to select. Statistical packages use various ad hoc selection methods, including F to enter, F to delete, Cp , and t-value cutoffs. Our approach to this problem is through the criterion that a good selection procedure selects dimensionality so as to give low prediction error (PE), where the PE of a regression equation is its expected squared error over the points in the X design. Because the true PE is unknown, the use of this criteria must be based on PE estimates. We introduce a method called the little bootstrap, which gives almost unbiased estimates for submodel PEs and then uses these to do submodel selection. Comparison is made to Cp and other methods by analytic examples and simulations. Little bootstrap does well—Cp and, by implication, all selection methods not based on data reuse give highly biased results and poor subset selection.

Keywords

MathematicsRegression analysisCurse of dimensionalityStatisticsRegression diagnosticFeature selectionSequence (biology)RegressionVariablesLinear regressionPartial least squares regressionComputer sciencePolynomial regressionArtificial intelligence

Affiliated Institutions

University of California, Berkeley US

Related Publications

Partial least squares regression and projection on latent structure regression (PLS Regression)

Hervé Abdi

Abstract Partial least squares (PLS) regression ( a.k.a. projection on latent structures) is a recent technique that combines features from and generalizes principal component a...

2010 Wiley Interdisciplinary Reviews Compu... 1363 citations

Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

Carsten F. Dormann , Jane Elith , Sven Bacher +15 more

Collinearity refers to the non independence of predictor variables, usually in a regression‐type analysis. It is a common feature of any descriptive ecological data set and can ...

2012 Ecography 9542 citations

Consistent Partial Least Squares for Nonlinear Structural Equation Models

Theo K. Dijkstra , Karin Schermelleh-Engel

Partial Least Squares as applied to models with latent variables, measured indirectly by indicators, is well-known to be inconsistent. The linear compounds of indicators that PL...

2013 Psychometrika 115 citations

Criteria for Selection of a Subset Regression: Which One Should Be Used?

R. R. Hocking

Abstract The problem of selecting the best subset of predictor variables in a linear regression model has led to the development of a number of criteria for choosing between con...

1972 Technometrics 78 citations

Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR

Howard D. Bondell , Brian J. Reich

Summary Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In thi...

2007 Biometrics 456 citations

Publication Info

Year: 1992
Type: article
Volume: 87
Issue: 419
Pages: 738-754
Citations: 272
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

The Little Bootstrap and other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

272

OpenAlex

Cite This

APA Style

                            
                                    Leo Breiman
                                
                            (1992). 
                            The Little Bootstrap and other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error. 
                            Journal of the American Statistical Association
                            , 87
                            (419)
                            , 738-754.
                            https://doi.org/10.1080/01621459.1992.10475276

Identifiers

DOI: 10.1080/01621459.1992.10475276