Feature selection is not that useful?



I've been doing a few DataScience competitions now, and i'm noticing something quite odd and frustrating to me. Why is it frustrating? Because , in theory, when you read about datascience it's all about features, and the careful selection, extraction and engineering of those to extract the maximum information out of raw variables, and so far, throwing every variable as it is in the mix seems to work fine with the right encodings. Even removing a variable that has 80% nulls ( which in theory should be an overfitting contributor ) decreases the performance of the regression model slightly.

For a pratical case : I have long/lat for a pickup point and destination point. I did the logical task of calculating the distance ( all kinds of them ) from these points. And dropped the long/lat. Model performs way better when you include both ( coordinates & distance ) in the features list. Any explanations? And a general thought on my dilemma here with the real utility of feature selection/engineering/extraction

EDIT : could it be that the information we can get out of the coordinates is bigger than the distance? Is it just possible to extract features that are more beneficial to my model that plain long/lat?


Posted 2019-09-18T14:13:58.033

Reputation: 1 704

1This is very hard to answer because it is so general. The case for feature engineering is trivial, let's say I have just a data frame consisting of the label to be predicted and a sentence coded as one string. Most if not all models won't even work without feature engineering in this case. Maybe all your data sets have all been neatly pre-processed or any other of thousand possible cases. In your example you say yourself that the model performs better with more, engineered features than with less (unless only coordinates performs better than coordinates plus distance). – Fnguyen – 2019-09-18T14:23:33.037



My experience is the same. I think in my case at least it's largely down to the algorithms I would generally use, all of which have the capacity to ignore features or down-weight them to insignificance where they're not particularly useful to the model. For example, a Random Forest will simply not select particular features to split against. A neural network will just weight features into having no effect on the output and so on. My experience is that algorithms which take every feature into account (like a vanilla linear regression model) generally suffer far more.

Additionally, in a "production" rather than competitive environment I found that feature selection became much more important. This is generally due to covariate shift - the distribution of values for certain features changes over time and, where that change is significant between your training dataset and the live predictions you're making day-to-day, this can really trash your model's outputs completely. This kind of problem seems to be administered out of the datasets used for competitions, so I never experienced it until starting to use ML at work.

Dan Scally

Posted 2019-09-18T14:13:58.033

Reputation: 1 574

2True, i have live models in production. Its age is 1 month in production, so i haven't seen its degradation so far with the distribution change of some features, but from a logical standpoint, it makes sense that it would degrade over time. Frustration stems from the fact that after i find some cool insights during the EDA phase, i engineer something according to it, and it almost always performs worse than the original feature, unless my feature engineering is incorrect or is lacking something hence my question for experienced datascientists that might encountered this existential dilemma. – Blenz – 2019-09-18T15:23:15.727

@Blenz I just started competitions. I'm working on the titanic right now, and ran into this exact frustration. I think I'm being clever, but most of what I engineer makes the model worse. I'm trying to figure out if a.) I'm really not that clever or b.) it just doesn't matter. – rocksNwaves – 2020-03-27T19:01:06.753

1@rocksNwaves, some months later and i still do not have an answer for this question. I ran into the same issue as you when i engineered something really intuitive and it worsened the model. Perhaps, our interpretation of a feature's intuition is flawed. I'm still not sure till this day unless a Kaggle grandmaster comes in to this thread and gives a convincing answer. – Blenz – 2020-03-29T14:05:25.637


If you want to perform linear regression with feature selection, you can formulate the problem as a MIO and solve it to optimality.

Then you can check if its worth it to do the feature selection.

Graph4Me Consultant

Posted 2019-09-18T14:13:58.033

Reputation: 893