## Why does removal of some features improve the performance of random forests on some occasions?

3

I completed feature importance of a random forest model. I removed the bottom 4 features out of 17 features. The model performance actually improved. Shouldn't the performance degrade after removal of some features, given that some data has been lost? What are some reasons to explain the performance improvement?

## Answers

3

A basic decision tree is pruned to reduce the likelihood of over-fitting to the data and so help to generalise. Random forests don't usually require pruning because each individual tree is trained on a random subset of the features and when trees are combined, there is little correlation between them, reducing the risk of over-fitting and building dependencies between trees.

There could be a few reasons why you get this unexpected improved performance, mostly depending on how you trained the random forest. If you did any of the following, you potentially allowed overfitting to creep in:

• a small number of random trees was used
• trees with high strength were used; meaning very deep, learning idiosyncrasies of the training set
• correlation between your features

and so removing features, you have allowed your model to generalise slightly more and so improve its performance.

It might be a good idea to remove any features that are highly correlated e.g. if two features have a pairwise correlation of >0.5, simply remove one of them. This would essentially be what you did (removing 3 features), but in a more selective manner.

## Overfitting in Random Forests

You can read a bit more about the reasons above on Wikipedia or in some papers about random forests that discuss issues:

1. Random forest paper by Leo Breiman - states in the conclusion section:

Because of the Law of Large Numbers, they do not overfit.

but also mentions the requirement of appropriate levels of randomness.

2. Elements of Statistical Learning by Hastie et. al (specifically section 15.3.4 Random Forests and Overfitting) gives more insight, referring to the increase of the number of data samples taken from your training set:

at the limit, the average of fully grown trees can result in too rich a model, and incur unnecessary variance

So there is a trade-off perhaps, between the number of features you have, the number of trees used, and their depths. Work has been done to control the depth of trees, with some success - I refer you to Hastie et. al for more details and references.

Here is an image from the book, which shows results of a regression experiment controlling the depth of trees via minimum Node Size. So requiring larger nodes effectively restricts your decision trees from being grown too far, and therefore reducing overfitting.

As a side note, section 15.3.2 addresses variable importance, which might interest you.

I assume that you trained ("grew") your random forest on some training data and tested the performance on some hold-out data, so the performance you speak of is valid.

is 150 trees considered small or large? – user781486 – 2019-10-19T13:32:32.433

2It depends on the size of your dataset. There are some approximations to the amount of variance you can expect in your model (or reduction, due to random trees compared to a single decision tree), depending on the sample size. I'd suggest trying different percentages of your sample set. E.g. For n=1000, 10% would mean 100 trees. Try something like 1, 10, 20, 50, 75 and 100 % and compare results. It will also depend on the variance on your dataset, which is why there isn't a simple answer. – n1k31t4 – 2019-10-19T15:27:23.937