I think this approach is mistaken, but perhaps it will be more helpful if I explain why. Wanting to know the best model given some information about a large number of variables is quite understandable. Moreover, it is a situation in which people seem to find themselves regularly. In addition, many textbooks (and courses) on regression cover stepwise selection methods, which implies that they must be legitimate. Unfortunately however, they are not, and the pairing of this situation and goal are quite difficult to successfully navigate. The following is a list of problems with automated stepwise model selection procedures (attributed to Frank Harrell, and copied from here):

- It yields R-squared values that are badly biased to be high.
- The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.
- The method yields confidence intervals for effects and predicted values that are falsely narrow; see Altman and Andersen
(1989).
- It yields p-values that do not have the proper meaning, and the proper correction for them is a difficult problem.
- It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see
Tibshirani [1996]).
- It has severe problems in the presence of collinearity.
- It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses.
- Increasing the sample size does not help very much; see Derksen and Keselman (1992).
- It allows us to not think about the problem.
- It uses a lot of paper.

The question is, what's so bad about these procedures / why do these problems occur? Most people who have taken a basic regression course are familiar with the concept of regression to the mean, so this is what I use to explain these issues. (Although this may seem off-topic at first, bear with me, I promise it's relevant.)

Imagine a high school track coach on the first day of tryouts. Thirty kids show up. These kids have some underlying level of intrinsic ability to which neither the coach, nor anyone else, has direct access. As a result, the coach does the only thing he can do, which is have them all run a 100m dash. The times are presumably a measure of their intrinsic ability and are taken as such. However, they are probabilistic; some proportion of how well someone does is based on their actual ability and some proportion is random. Imagine that the true situation is the following:

```
set.seed(59)
intrinsic_ability = runif(30, min=9, max=10)
time = 31 - 2*intrinsic_ability + rnorm(30, mean=0, sd=.5)
```

The results of the first race are displayed in the following figure along with the coach's comments to the kids.

Note that partitioning the kids by their race times leaves overlaps on their intrinsic ability--this fact is crucial. After praising some, and yelling at some others (as coaches tend to do), he has them run again. Here are the results of the second race with the coach's reactions (simulated from the same model above):

Notice that their intrinsic ability is identical, but the times bounced around relative to the first race. From the coach's point of view, those he yelled at tended to improve, and those he praised tended to do worse (I adapted this concrete example from the Kahneman quote listed on the wiki page), although actually regression to the mean is a simple mathematical consequence of the fact that the coach is selecting athletes for the team based on a measurement that is partly random.

Now, what does this have to do with automated (e.g., stepwise) model selection techniques? Developing and confirming a model based on the same dataset is sometimes called *data dredging*. Although there is some underlying relationship amongst the variables, and stronger relationships are expected to yield stronger scores (e.g., higher t-statistics), these are random variables and the realized values contain error. Thus, when you select variables based on having higher (or lower) realized values, they may be such because of their underlying true value, error, or both. If you proceed in this manner, you will be as surprised as the coach was after the second race. This is true whether you select variables based on having high t-statistics, or low intercorrelations. True, using the AIC is better than using p-values, because it penalizes the model for complexity, but the AIC is itself a random variable (if you run a study several times and fit the same model, the AIC will bounce around just like everything else). Unfortunately, this is just a problem intrinsic to the epistemic nature of reality itself.

I hope this is helpful.

1What about bootStepAIC Package in R? – None – 2015-04-22T19:02:31.907

60Frankly, I think this is a

disastrousidea, just about guaranteed to lead to many false conclusions. – gung – 2012-01-09T18:30:54.6303@gung: while I agree that blindly following the result of a model selection is a bad idea, I think it can be useful as a starting point of an analysis. In my case I have several hundreds of factors available, and I would like to pick the 5-10 most relevant. I don't see how I could do that without automatic model selection (which will later be manually amended). – S4M – 2012-01-10T09:33:50.327

11

Allmodel selection procedures are subject to the problems that I discuss in my answer below. In addition, the larger the number of possible factors you want to search over, the more extreme those problems become, and the increase is not linear. While there are some better approaches (discussed by @Zach), which should be used in conjunction with cross-validation (discussed by @JackTanner), selecting based on t, r and AIC are not among them. Moreover, with hundreds of factors the amount of data needed could easily be in the millions. Unfortunately, you have averydifficult task before you. – gung – 2012-01-10T16:21:00.3204What is the purpose of doing model selection? Is it for a predictive/forecasting model or are you looking for the important variables? Also how big is the data set you are using - how many obsevations and how many variables? – probabilityislogic – 2012-01-31T05:42:29.757

4

Interesting views here, but I think the negative view towards algorithmic model selection procedures is a bit dated. Take, for instance, the recent work by David Hendry in the field of econometrics, particularly his work on the PcGive software and saturation methods. A lecture providing an overview of his approach can be found here. As @MichaelChernick has pointed out (and Hendry would do, too!), subject matter knowledge is (vastly) important. This is why there's value in subject specialists - to let the algorithms act alone is the mistake.

– Graeme Walsh – 2016-11-29T04:01:16.6502I always say beware of automatic algorithms. It always helps to include subject matter knowledge. Stepwise procedures have problems. I twould pay for you to read one of the many books available on model selection. – Michael Chernick – 2012-05-04T17:57:52.437