Dealing with correlated features when calculating permutation importance


I have implemented the permutation importance calculation as found here in my attempt to identify features that contribute little to my model's (Gradient Boosted Tree model) predictive power.

The issue I have encountered is that some of my features are highly correlated, potentially masking the true importance of features when being evaluated by permutation importance. Typically the solution to this would be to perform something like Recursive Feature Elimination instead. Unfortunately, I cannot do this as the cost of retraining the model is prohibitive. The model takes ~3 hours to train with a feature set of 39 features.

My question is whether it is possible to use permutation importance while dealing with correlated features? My initial thought was to invert the process and shuffle all features other than the one I want to investigate although I do not know if this would have the same level of explanation.


Posted 2019-02-25T13:47:21.813

Reputation: 172



Using mutual information namely the correlation of each feature and the output is not something that can be helpful very much. The reason is that the correlation coefficient is just capable of finding whether they are linearly dependent or not. Other than Gaussian distribution, as far as I know, it is not even capable of finding out whether the inputs are independent or not which means if you see the coefficient is equal to zero, if you don't know the distribution you cannot conclude that they are independent but you can say they are not linearly dependent. In real-world applications, it scarcely happens that features just have linear dependencies with the output. Consequently, approaches like this are not very helpful because you may have two features that can have a better relationship with the output but none of them has a good separation with the output.

For cases that you have a lot of features, you can decide to choose other feature selection and extraction methods. I guess for your case wrapper method may suit better. In the wrapper method, you don't have any criterion and you just search through all possibilities of features using heuristic-based methods in order to find the best sub-features which have the best cross-validation accuracy.


Posted 2019-02-25T13:47:21.813

Reputation: 12 077

I see what you mean, unfortunately as I said before running some feature selection is going to be costly and I would like to avoid it as much as possible. Would you have any suggestions as to have to avoid this? – HFulcher – 2019-02-26T15:49:51.690

I guess it's better not to use correlation at all. If you cann't use wrapper method I guess the next best option can be mutual information. It is costly but the benefit of that is this that it enables you find relations other than linear dependencies. – Media – 2019-02-26T16:18:54.390