Why would a fake feature with random numbers get selected in feature importance?



I'm using a sklearn.ensemble.RandomForestClassifier(n_estimators=100) to work on this challenge: https://kaggle.com/c/two-sigma-financial-news

I've plotted my feature importance:

enter image description here

I created a fake feature called random which is just numbers pulled from np.random.randn(). Unfortunately, it seems to have quite significant feature importance.

How am I supposed to interpret this? I had expected it to be at the bottom.

PS xgboost seems to discard this feature, as it should.


Posted 2018-11-14T11:49:16.150

Reputation: 235



Scikit-learn's random forest feature importance is based on mean decrease in impurity, which is fast to compute and faithful to the original creation of the Random Forest method. In short, the default feature_importances_ gives a numerical justification of the Random Forest's representation of the feature importance using its native metric of construction. Like you saw, this metric has the drawback that it can say noise is an important feature. So you may want to consider other feature importance methods, like permutation feature importance, which will give you a more apples-to-apples comparison with other models you will test. There are pros and cons to many of these methods so be aware of them.


Posted 2018-11-14T11:49:16.150

Reputation: 525