## Algorithms for automatic model selection

149

276

I would like to implement an algorithm for automatic model selection. I am thinking of doing stepwise regression but anything will do (it has to be based on linear regressions though).

My problem is that I am unable to find a methodology, or an open source implementation (I am woking in java). The methodology I have in mind would be something like:

1. calculate the correlation matrix of all the factors
2. pick the factors that have a low correlation to each other
3. remove the factors that have a low t-stat
4. add other factors (still based on the low correlation factor found in 2.).
5. reiterate several times until some criterion (e.g AIC) is over a certain threshold or cannot or we can't find a larger value.

I realize there is an R implementation for this (stepAIC), but I find the code quite hard to understand. Also I have not been able to find articles describing the stepwise regression.

1What about bootStepAIC Package in R? – None – 2015-04-22T19:02:31.907

60Frankly, I think this is a disastrous idea, just about guaranteed to lead to many false conclusions. – gung – 2012-01-09T18:30:54.630

3@gung: while I agree that blindly following the result of a model selection is a bad idea, I think it can be useful as a starting point of an analysis. In my case I have several hundreds of factors available, and I would like to pick the 5-10 most relevant. I don't see how I could do that without automatic model selection (which will later be manually amended). – S4M – 2012-01-10T09:33:50.327

11All model selection procedures are subject to the problems that I discuss in my answer below. In addition, the larger the number of possible factors you want to search over, the more extreme those problems become, and the increase is not linear. While there are some better approaches (discussed by @Zach), which should be used in conjunction with cross-validation (discussed by @JackTanner), selecting based on t, r and AIC are not among them. Moreover, with hundreds of factors the amount of data needed could easily be in the millions. Unfortunately, you have a very difficult task before you. – gung – 2012-01-10T16:21:00.320

4What is the purpose of doing model selection? Is it for a predictive/forecasting model or are you looking for the important variables? Also how big is the data set you are using - how many obsevations and how many variables? – probabilityislogic – 2012-01-31T05:42:29.757

4

Interesting views here, but I think the negative view towards algorithmic model selection procedures is a bit dated. Take, for instance, the recent work by David Hendry in the field of econometrics, particularly his work on the PcGive software and saturation methods. A lecture providing an overview of his approach can be found here. As @MichaelChernick has pointed out (and Hendry would do, too!), subject matter knowledge is (vastly) important. This is why there's value in subject specialists - to let the algorithms act alone is the mistake.

– Graeme Walsh – 2016-11-29T04:01:16.650

2I always say beware of automatic algorithms. It always helps to include subject matter knowledge. Stepwise procedures have problems. I twould pay for you to read one of the many books available on model selection. – Michael Chernick – 2012-05-04T17:57:52.437

-43

I see my question generated lots of interest and an interesting debate about the validity of the automatic model selection approach. While I agree that taking for granted the result of an automatic selection is risky, it can be used as a starting point. So here is how I implemented it for my particular problem, which is to find the best n factors to explain a given variable

1. do all the regressions variable vs individual factors
2. sort the regression by a given criterion (say AIC)
3. remove the factors that have a low t-stat: they are useless in explaining our variable
4. with the order given in 2., try to add the factors one by one to the model, and keep them when they improve our criterion. iterate for all the factors.

Again, this is very rough, there may be ways to improve the methodology, but that is my starting point. I am posting this answer hoping it can be useful for someone else. Comments are welcome!

24you did actually accept your own answer that moves against every argument brought forward by the community. Not surprising to see the negatives here... – jank – 2013-11-21T11:56:39.860

21I believe it's the first time i see so many downvotes. @SAM why don't you just accept some of the other excellent answers and delete your "answer"? – marbel – 2014-03-16T20:00:55.087

43

(1) I haven't witnessed any "debate" in this thread: all replies and comments present the same basic message. (2) Your implementation appears to be an ad hoc version of stepwise regression. I agree that it can be useful as a starting point provided it is not automatically accepted as an end in itself.

– whuber – 2012-02-11T16:25:22.243

Is this SE's most downvoted answer ever? – mkt – 2017-12-04T15:44:25.507

271

I think this approach is mistaken, but perhaps it will be more helpful if I explain why. Wanting to know the best model given some information about a large number of variables is quite understandable. Moreover, it is a situation in which people seem to find themselves regularly. In addition, many textbooks (and courses) on regression cover stepwise selection methods, which implies that they must be legitimate. Unfortunately however, they are not, and the pairing of this situation and goal are quite difficult to successfully navigate. The following is a list of problems with automated stepwise model selection procedures (attributed to Frank Harrell, and copied from here):

1. It yields R-squared values that are badly biased to be high.
2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.
3. The method yields confidence intervals for effects and predicted values that are falsely narrow; see Altman and Andersen (1989).
4. It yields p-values that do not have the proper meaning, and the proper correction for them is a difficult problem.
5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani [1996]).
6. It has severe problems in the presence of collinearity.
7. It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses.
8. Increasing the sample size does not help very much; see Derksen and Keselman (1992).
9. It allows us to not think about the problem.
10. It uses a lot of paper.

The question is, what's so bad about these procedures / why do these problems occur? Most people who have taken a basic regression course are familiar with the concept of regression to the mean, so this is what I use to explain these issues. (Although this may seem off-topic at first, bear with me, I promise it's relevant.)

Imagine a high school track coach on the first day of tryouts. Thirty kids show up. These kids have some underlying level of intrinsic ability to which neither the coach, nor anyone else, has direct access. As a result, the coach does the only thing he can do, which is have them all run a 100m dash. The times are presumably a measure of their intrinsic ability and are taken as such. However, they are probabilistic; some proportion of how well someone does is based on their actual ability and some proportion is random. Imagine that the true situation is the following:

set.seed(59)
intrinsic_ability = runif(30, min=9, max=10)
time = 31 - 2*intrinsic_ability + rnorm(30, mean=0, sd=.5)

The results of the first race are displayed in the following figure along with the coach's comments to the kids.

Note that partitioning the kids by their race times leaves overlaps on their intrinsic ability--this fact is crucial. After praising some, and yelling at some others (as coaches tend to do), he has them run again. Here are the results of the second race with the coach's reactions (simulated from the same model above):

Notice that their intrinsic ability is identical, but the times bounced around relative to the first race. From the coach's point of view, those he yelled at tended to improve, and those he praised tended to do worse (I adapted this concrete example from the Kahneman quote listed on the wiki page), although actually regression to the mean is a simple mathematical consequence of the fact that the coach is selecting athletes for the team based on a measurement that is partly random.

Now, what does this have to do with automated (e.g., stepwise) model selection techniques? Developing and confirming a model based on the same dataset is sometimes called data dredging. Although there is some underlying relationship amongst the variables, and stronger relationships are expected to yield stronger scores (e.g., higher t-statistics), these are random variables and the realized values contain error. Thus, when you select variables based on having higher (or lower) realized values, they may be such because of their underlying true value, error, or both. If you proceed in this manner, you will be as surprised as the coach was after the second race. This is true whether you select variables based on having high t-statistics, or low intercorrelations. True, using the AIC is better than using p-values, because it penalizes the model for complexity, but the AIC is itself a random variable (if you run a study several times and fit the same model, the AIC will bounce around just like everything else). Unfortunately, this is just a problem intrinsic to the epistemic nature of reality itself.

Excellent excellent explanation! Thanks. – Mayou – 2013-08-23T17:47:59.903

1for an illustration why AIC does not solve the problem see: Mundry, R. (2011). Issues in information theory-based statistical inference—a commentary from a frequentist’s perspective. Behavioral Ecology and Sociobiology, 65(1), 57-68. – jank – 2013-11-21T11:52:45.247

So what does leaving out a predictor mean? cutting a predictor usually tend to increase significance of the others. What does it mean when a predictor before largely not significant improves and became significant after removing a not significant one? it means the the second was hiding the first. I'm pretty confused on the topic and I love your examples! Would you mind to take a look at this other my question? http://stats.stackexchange.com/questions/102834/to-step-or-not-to-step-for-model-selection-in-regression

– Bakaburg – 2014-06-10T16:09:05.017

1Among other things, if you drop a 2nd variable & the 1st becomes significant, it means that the p-value for the 1st is now invalid due to data dredging. I'm glad this helped you. I can take a look at your Q later; if appropriate, you might edit it to reflect what you've learned here & what you still need to know, so that the other Q wouldn't be a duplicate. – gung – 2014-06-10T16:19:24.040

I'd like to find a good clear question to this answer: when it's appropriate to leave out a predictor. What I don't understand is that if stepping out a variable messes the model, than every model is messed and type 1 error prone in respect of infinite variables never taken in account to begin with! on the other side, to include more and more variables you need more and more observations to have significant result and avoid other problems for example in logistic glm. So the only good model is one made on the whole population with all the possible variables taken in account!I'm clearly confused – Bakaburg – 2014-06-10T16:21:40.133

Please edit that information (if appropriate) into your other Q, not as a comment here. In addition, if this answer helped you, you might consider upvoting it. Unfortunately, I don't have time to address this right now, but that is a reasonable question; I'll try to put something together for you later. – gung – 2014-06-10T16:33:50.090

36Phenomenal explanation of data dredging. – Frank Harrell – 2012-01-10T13:45:32.643

2+1- I look forward to linking to this answer in the future. – David Robinson – 2012-01-11T06:36:32.553

14This is a very well thought out answer, although i completely disagree with the idea that aic is an improvement over p-values (or bic or similar), in the context of linear model selection. any penalty like aic which is of the form $-2L+kp$ is equivalent to setting the p-value to $Pr(\chi^2_1>k)$ (both entry and exit). aic basically tells you how to choose the p-value. – probabilityislogic – 2012-01-31T06:00:31.643

@probabilityislogic, thanks for your comment! I'm afraid I don't completely follow you, though. 1st, you disagree AIC>(p or BIC), or (AIC or BIC)>p? 2nd, what I was thinking of was selecting the model-as-a-whole based on AIC is better than selecting individual covariates to include based on their individual p-values calculated by $\beta$/SE (i.e., equivalent to type III SS) <.05 (that is, typical software output), as this method is the one people typically use. Would you still disagree? – gung – 2012-01-31T07:38:35.620

1I think it has 2 advantages: deals w/ model-as-a-whole, & equivalence to $P(\chi^2>k)$ is not to a static arbitrary value--the p that minimizes AIC in 1 model isn't necessarily the same p in another model. In other words, there's a rationale for that p in each case that is better founded than just <.05 = good!, >.05 = bad! – gung – 2012-01-31T07:39:09.597

7My comment was in regards to using aic for stepwise or similar algorithm. My comment was also too brief. Note $p$ is number of variables, $k$ is penalty ($2$ for aic $\log N$ for bic), and $-2L$ is negative twice the maximised log likelihood. Aic and bic are different conceptually but not operationally from p-values when doing "subset" style selection with no shrinkage of the non-zero coefficients. – probabilityislogic – 2012-01-31T09:29:29.353

@probabilityislogic, that helps to clarify, and I agree with all that. I didn't mean to suggest that stepwise selection based on AIC is good, only that it's less bad than using p-values. I understand what you mean by saying that AIC helps you choose a p-vlaue, but isn't that different (and slightly less bad) than having an arbitrary, fixed $\alpha$ that doesn't take model complexity into account? Are you suggesting there is no difference whatsoever? – gung – 2012-01-31T16:38:33.120

7@gung - if you take the difference between two models with one parameter different you get $(-2L_1+2p_0+2)-(-2L_0+2p_0)=-2(L_1-L_0)+2$. Now the first term is the likelihood ratio statistic upon which the p-value is based. So we are adding the extra parameter if the likelihood ratio statistic is bigger than some cutoff. This is the same as what the p-value approach is doing. There is only a conceptual difference here – probabilityislogic – 2012-02-01T01:11:48.690

While I understand that the lasso shrinks (and is just soft thresholding in orthogonal case), why is it true that model search inflates coefficients? – user795305 – 2017-01-03T02:31:16.897

1@Benjamin, you are simply picking the biggest coefficients (in absolute value). Realized values will bounce randomly around their true values; by picking only the biggest to keep, you are selecting the ones that are too high. – gung – 2017-01-03T02:35:07.560

@Gung, thanks for the quick reply!

Oh, I see! It's due to the nature of the stepwise procedure--and isn't a feature of model searching in general. Thanks! I misunderstood to think this might be an alternative reason (besides bias+variance tradeoff) in favor of lasso shrinkage.

In fact, now, I don't see any connection between lasso and these stepwise procedures. The lasso doesn't shrink the model selected by stepwise procedures--the selected variables may be different. (I know that it is related to stagewise via LARS though.) Why is lasso cited? – user795305 – 2017-01-03T02:42:37.837

1@Benjamin, LASSO doesn't 'select' variables in the sense that stepwise selection does; it just shrinks estimates to 0. At some value of lambda, some betas may actually be 0, but all betas will be shrunk in this manner. The selected variables may be similar, but their betas will not be. If you have a lambda a-priori, that's it; alternatively, you could use cross-validation to search over possible values for lambda that optimize the out of sample predictive accuracy. Neither of which are equivalent to stepwise selection. – gung – 2017-01-03T03:20:36.013

59

Check out the caret package in R. It will help you cross-validate step-wise regression models (use method='lmStepAIC' or method='glmStepAIC'), and might help you understand how these sorts of models tend to have poor predictive performance. Furthermore, you can use the findCorrelation function in caret to identify and eliminate collinear variables, and the rfe function in caret to eliminate variables with a low t-statistic (use rfeControl=rfeControl(functions=lmFuncs)).

However, as mentioned in the previous answers, these methods of variable selection are likely to get you in trouble, particularly if you do them iteratively. Make absolutely certain you evaluate your performance on a COMPLETELY held-out test set. Don't even look at the test set until you are happy with your algorithm!

Finally, it might be better (and simpler) to use predictive model with "built-in" feature selection, such as ridge regression, the lasso, or the elastic net. Specifically, try the method=glmnet argument for caret, and compare the cross-validated accuracy of that model to the method=lmStepAIC argument. My guess is that the former will give you much higher out-of-sample accuracy, and you don't have to worry about implementing and validating your custom variable selection algorithm.

1Penalties like the double pareto are better than ridge and lasso from a statistical perspective, as they don't shrink the clearly non-zero coefficients. But unfortunately, they always lead to a non-convex penalty, so they are worse from a computational perspective. I would think a penalty based on the Cauchy distribution would be good $\log(\lambda^2+\beta^2)$. – probabilityislogic – 2013-06-30T00:02:41.457

1@probabilityislogic Do you know of any good implementations of the double pareto penalty, in a language like r or python? I'd love to try it out. – Zach – 2013-06-30T01:29:41.427

From what I understand, model selection by AIC and leave-one-out-cross-validation is essentially the same thing (asymptotic equivalence, see Stone, 1977), so AIC and some types of cross-validation are likely to lead to very similar results. However, I haven't used the caret package, and from the method calls it seems like AIC is indeed used in some cases.

– fileunderwater – 2015-01-26T09:08:10.620

31

I fully concur with the problems outlined by @gung. That said, realistically speaking, model selection is a real problem in need of a real solution. Here's something I would use in practice.

1. Split your data into training, validation, and test sets.
2. Train models on your training set.
3. Measure model performance on the validation set using a metric such as prediction RMSE, and choose the model with the lowest prediction error.
4. Devise new models as necessary, repeat steps 2-3.
5. Report how well the model performs on the test set.

For an example of the use of this method in the real world, I believe that it was used in the Netflix Prize competition.

13Data splitting is not reliable unless $n>20000$. – Frank Harrell – 2012-01-10T13:46:41.363

5@Frank: Why do you think N needs to be so high? – rolando2 – 2012-01-11T02:29:26.837

11Because of poor precision. If you split again you can get much different results. That's why people do 100 repeats of 10-fold cross-validation, or bootstrapping. – Frank Harrell – 2012-01-11T03:07:52.550

6@FrankHarrell What does that n>20000 figure depend on? Is it based on the original poster's comment about having "several hundreds of factors"? Or is it independent of any aspect of the data? – Darren Cook – 2012-01-13T00:51:25.183

26

The kind of setting that I testing data splitting on was n=17000 with a fraction of 0.3 having an event, and having about 50 parameters examined or fitted in a binary logistic model. I used a random 1:1 split. The validated ROC area in the test sample changed substantively when I re-split the data and started over. Look under Studies of Methods Used in the Text in http://biostat.mc.vanderbilt.edu/rms for simulation studies and related papers giving more information.

– Frank Harrell – 2012-01-13T13:41:55.640

This info is much appreciated. – rolando2 – 2012-01-29T16:23:19.913

Do you have some more details about the Netflix competition, and how the solution was using this method? – danieldekay – 2016-07-03T09:32:34.983

12

To answer the question, there are several options: 1) all-subset by AIC/BIC 2) stepwise by p-value 3) stepwise by AIC/BIC 4) regularisation such as LASSO (can be based on either AIC/BIC or CV 5) genetic algorithm (GA) 6) others? 7) use of non-automatic, theory ("subject matter knowledge") oriented selection

Next question would be which method is better. This paper (doi:10.1016/j.amc.2013.05.016) indicates “all possible regression” gave the same results to their proposed new method and stepwise is worse. A simple GA is between them. This paper (DOI:10.1080/10618600.1998.10474784) compares penalized regression (Bridge, Lasso etc) with “leaps-and-bounds” (seems an exhaustive search algorithm but quicker) and also found “the bridge model agrees with the best model from the subset selection by the leaps and bounds method”. This paper (doi:10.1186/1471-2105-15-88) shows GA is better than LASSO. This paper (DOI:10.1198/jcgs.2009.06164) proposed a method - essentially an all-subset (based on BIC) approach but cleverly reduce the computation time. They demonstrate this method is better than LASSO. Interestingly, this paper (DOI: 10.1111/j.1461-0248.2009.01361.x) shows methods (1)-(3) produce similar performance.

So overall the results are mixed but I got an impression that GA seems very good although stepwise may not be too bad and it is quick.

As for 7), the use of non-automatic, theory ("subject matter knowledge") oriented selection. It is time consuming and it is not necessarily better than automatic method. In fact in time-series literature, it is well established that automated method (especially commercial software) outperforms human experts "by a substantial margin" (doi:10.1016/S0169-2070(01)00119-4, page561 e.g. selecting various exponential smoothing and ARIMA models).

6Be aware that you can get differing performance in simulation studies of different selection algorithms by changing the data generating process to favor (even if not intentionally) a particular routine. The issue of what approach will be faster or slower is distinct, but potentially still important. – gung – 2015-03-17T16:30:10.450

2

In fact the examples in Tibshirani's original paper on LASSO illustrate @gung's point well. The same goes for comparative studies of different methods on real data. BTW, is your last reference right? The paper by Clements & Hendry with the DOI you give doesn't make the claim that automated methods out-perform human experts, or use the words "by a substantial margin" at all. (It'd be nice if you gave full references.)

– Scortchi – 2015-03-17T17:47:59.113

Found it: Goodrich (2001), "Commercial software in the M3-Competition", Int. J. Forecast., 17, pp 560–565. It's under the same "Commentaries on the M3-Competition" collection as the Clements & Hendry paper, which is why Googling the DOI sent me there. – Scortchi – 2015-03-17T18:21:31.517

Anyway, it's not at all clear that Goodrich's comment has anything to do with subject-matter expertise. – Scortchi – 2015-03-17T20:45:18.443

1@Scortchi it may be more clear on the article starting from page 581 on that issue of journal. It is more related to "M2-Competition" in which, automatic forecasting methods were compared with invited human experts (including some big names in time-series literature) who knew economic/industry context and could even ask for additional information from companies that provided the data. – heran_xp – 2015-03-18T17:45:26.753

4

Here's an answer out of left field- instead of using linear regression, use a regression tree (rpart package). This is suitable for automatic model selection because with a little work you can automate the selection of cp, the parameter used to avoid over-fitting.

2

linear model can be optimised by implementing genetic algorithm in the way of choosing most valuable independant variables. The variables are represented as genes in the algorithm, and the best chromosome (set of genes) are then being selected after crossover, mutation etc. operators. It is based on natural selection - then best 'generation' may survive, in other words, the algorithm optimises estimation function that depends on the particular model.

1That would select the "best" variables in the data, not necessarily the best variables in the data generating process / population, as it only has access to the data. It isn't really that dissimilar from the stepwise method the OP wanted. – gung – 2015-03-14T16:59:55.687

-1

We do have a function in the R base stats package, called step(), which does, forward, backward or stepwise selection of models based on lowest AIC. This also works for factor variables. Does this not server the purpose here?.