LASSO remaining features for different penalisation


I am using the sklearn LASSOCV function and I am changing the penalisation parameter in order to adjust the number of features killed off. For example for $\alpha = 0.01$ I have 55 features remaining and for $\alpha=0.5$ I have 6 remaining features.

I would expect that the 6 features I am getting for the second case are a subset of the 55 features I am getting in the first case. This is not what is happening however. 4 out of 6 features are not in the 55 of the first case.

Could someone explain to me the intuition behind why this is happening?

Thanks, P


Posted 2019-08-24T16:59:47.660

Reputation: 41

It doesn't seem like this should happen; at least in the orthonormal input case, there's a closed-form solution to LASSO, with coefficients being monotonic in the regularization parameter: Maybe collinearity makes it possible? It's also common to view the "regularization path," and I've never seen a path breaking away from zero (e.g. , Are you able to share the data and code so we can play with it?

– Ben Reiniger – 2019-08-24T21:13:26.347

@BenReiniger I will view the regularisation path and comment if I find something of interest. Unfortunately I am not able to share the data at this point. – prax1telis – 2019-08-26T11:24:19.823



Lambda is a tuning parameter („how much regularisation“, I think called alpha in sklearn) and you would choose lambda so that you optimise fit (e.g. by MSE). You can do this by running cross validation.

This page (for the GLMnet package in R) explains how to apply Lasso in a very instructive way (alpha is the elastic-net mixing parameter here, Lambda is the tuning parameter).

You may also look at „ Introduction to Statistical Learning“, Ch. 6.2, in which Lasso is discussed in a very good way.

There are also Python Labs for the book, which should give you a blueprint for how to use Lasso in Python (see Section 6.6.2).


Posted 2019-08-24T16:59:47.660

Reputation: 4 724


There is nothing that guarantee the six features will be a subset of the 55. Intuitively, if you parametrize you problem with number of features instead of lambda, it is easier to see how two variables could better explain the output than a single different one.

Another problem that can arise from these regularisation methods come from correlated variables. If you take two higly correlated explanatory variables that have similar impact on the output, you may never be sure which one the regularisation process will keep. In the context of correlated variables, it is even easier to see how by almost arbitrarily keeping one variable instead of another you can end up with some instability in variables choosen.

Note that certain implementations (glmnet in R) proceeds differently. It proceeds iteratively trough differents values of $\lambda$ and remove variables as $\lambda$ increase... so that features for higher alphas are a subset of features for lower alphas.

On another topic, this may hint that you need to do some work on feature engeeniring, as some variable appears to capture redundant information. A good feature engineering work will also be helpfull for model explainability.


Posted 2019-08-24T16:59:47.660

Reputation: 1 874