When should I use lasso vs ridge?

138

112

Say I want to estimate a large number of parameters, and I want to penalize some of them because I believe they should have little effect compared to the others. How do I decide what penalization scheme to use? When is ridge regression more appropriate? When should I use lasso?

Larry Wang

Posted 2010-07-28T01:10:18.423

Reputation: 796

You say "lasso vs ridge" as if they're the only two options - what about generalised double pareto, horseshoe, bma, bridge, among others? – probabilityislogic – 2014-04-12T01:16:59.337

2

A similar question has just been asked on metaoptimize (keeping in mind that l1 = LASSO and l2 = ridge): http://metaoptimize.com/qa/questions/5205/when-to-use-l1-regularization-and-when-l2

– Gael Varoquaux – 2011-03-20T20:13:05.083

"Say I want to estimate a large number of parameters" this could be made more precise: What is the framework ? I guess it is linear regression? – robin girard – 2010-07-28T05:59:31.147

Answers

94

Keep in mind that ridge regression can't zero out coefficients; thus, you either end up including all the coefficients in the model, or none of them. In contrast, the LASSO does both parameter shrinkage and variable selection automatically. If some of your covariates are highly correlated, you may want to look at the Elastic Net [3] instead of the LASSO.

I'd personally recommend using the Non-Negative Garotte (NNG) [1] as it's consistent in terms of estimation and variable selection [2]. Unlike LASSO and ridge regression, NNG requires an initial estimate that is then shrunk towards the origin. In the original paper, Breiman recommends the least-squares solution for the initial estimate (you may however want to start the search from a ridge regression solution and use something like GCV to select the penalty parameter).

In terms of available software, I've implemented the original NNG in MATLAB (based on Breiman's original FORTRAN code). You can download it from:

http://www.emakalic.org/blog/wp-content/uploads/2010/04/nngarotte.zip

BTW, if you prefer a Bayesian solution, check out [4,5].

References:

[1] Breiman, L. Better Subset Regression Using the Nonnegative Garrote Technometrics, 1995, 37, 373-384

[2] Yuan, M. & Lin, Y. On the non-negative garrotte estimator Journal of the Royal Statistical Society (Series B), 2007, 69, 143-161

[3] Zou, H. & Hastie, T. Regularization and variable selection via the elastic net Journal of the Royal Statistical Society (Series B), 2005, 67, 301-320

[4] Park, T. & Casella, G. The Bayesian Lasso Journal of the American Statistical Association, 2008, 103, 681-686

[5] Kyung, M.; Gill, J.; Ghosh, M. & Casella, G. Penalized Regression, Standard Errors, and Bayesian Lassos Bayesian Analysis, 2010, 5, 369-412

emakalic

Posted 2010-07-28T01:10:18.423

Reputation: 1 395

Could you be more specific on ridge vs lasso? Is automatic variable selection the only reason to prefer lasso? – Chogg – 2017-10-03T17:43:08.160

30

Ridge or lasso are forms of regularized linear regressions. The regularization can also be interpreted as prior in a maximum a posteriori estimation method. Under this interpretation, the ridge and the lasso make different assumptions on the class of linear transformation they infer to relate input and output data. In the ridge, the coefficients of the linear transformation are normal distributed and in the lasso they are Laplace distributed. In the lasso, this makes it easier for the coefficients to be zero and therefore easier to eliminate some of your input variable as not contributing to the output.

There are also some practical considerations. The ridge is a bit easier to implement and faster to compute, which may matter depending on the type of data you have.

If you have both implemented, use subsets of your data to find the ridge and the lasso and compare how well they work on the left out data. The errors should give you an idea of which to use.

Hbar

Posted 2010-07-28T01:10:18.423

Reputation: 826

5I don't get it - how would you know if your coefficients are laplace or normal distributed? – ihadanny – 2015-09-20T21:15:20.717

1Why is Ridge regression faster to compute? – Archie – 2017-04-05T12:08:52.707

3@Hbar: " The regularization can also be interpreted as prior in a maximum a posteriori estimation method. ": could you please explain this part in more detail with mathematical symbols, or at least give a reference? Thanks! – Mathmath – 2017-09-17T17:19:36.127

2@ihadanny You most probably wouldn't know, and that's the point. You can only decide which one to keep a posteriori. – Firebug – 2017-10-11T16:23:45.220

23

Generally, when you have many small/medium sized effects you should go with ridge. If you have only a few variables with a medium/large effect, go with lasso. Hastie, Tibshirani, Friedman

Gary

Posted 2010-07-28T01:10:18.423

Reputation: 808

But when you have a few variables, you may want to keep them all in your models if they medium/large effects, which will not be the case in lasso as it might remove one of them. Can you please explain this in detail? I feel when you have many variables we use Lasso to remove unnecessary variables and not ridge. – aditya bhandari – 2017-10-11T16:20:05.347