A basic decision tree is pruned to reduce the likelihood of over-fitting to the data and so help to generalise. Random forests don't usually require pruning because each individual tree is trained on a random subset of the features and when trees are combined, there is little correlation between them, reducing the risk of over-fitting and building dependencies between trees.
There could be a few reasons why you get this unexpected improved performance, mostly depending on how you trained the random forest. If you did any of the following, you potentially allowed overfitting to creep in:
- a small number of random trees was used
- trees with high strength were used; meaning very deep, learning idiosyncrasies of the training set
- correlation between your features
and so removing features, you have allowed your model to generalise slightly more and so improve its performance.
It might be a good idea to remove any features that are highly correlated e.g. if two features have a pairwise correlation of >0.5, simply remove one of them. This would essentially be what you did (removing 3 features), but in a more selective manner.
Overfitting in Random Forests
You can read a bit more about the reasons above on Wikipedia or in some papers about random forests that discuss issues:
Random forest paper by Leo Breiman - states in the conclusion section:
Because of the Law of Large Numbers, they do not overfit.
but also mentions the requirement of appropriate levels of randomness.
Elements of Statistical Learning by Hastie et. al (specifically section 15.3.4 Random Forests and Overfitting) gives more insight, referring to the increase of the number of data samples taken from your training set:
at the limit, the average of fully grown trees can result in too rich a model, and incur unnecessary variance
So there is a trade-off perhaps, between the number of features you have, the number of trees used, and their depths. Work has been done to control the depth of trees, with some success - I refer you to Hastie et. al for more details and references.
Here is an image from the book, which shows results of a regression experiment controlling the depth of trees via minimum Node Size. So requiring larger nodes effectively restricts your decision trees from being grown too far, and therefore reducing overfitting.
As a side note, section 15.3.2 addresses variable importance, which might interest you.
I assume that you trained ("grew") your random forest on some training data and tested the performance on some hold-out data, so the performance you speak of is valid.