Random Forests with complementary features



In my dataset, I have 2 features that are not only correlated but that makes sense only in the presence of each other. For instance, one would be the number of times a task was attempted and the other one would be the number of successes.

As mentioned, it seems to be me that taken one of the 2 individually does not give any information. Should I do a scheme where if I pick one of them in a tree of my RF, I automatically include the other one?

And if so, is it possible to do so using the RF class from scikit-learn?



Posted 2018-06-21T21:20:55.933

Reputation: 131

1I guess the RF will automatically found it out.. – Aditya – 2018-06-22T00:42:50.470

Can you check if the two feature are correlated? you can only do so if both the features are numerical, if either of them is categorical a different type of analysis needs to be performed. if both fo them are numerical (continuous) then you can test for correlation and if they are correlated then you can drop either one. – Kaustubh – 2018-06-22T05:13:13.120



Notice that random forests (and decision trees in general) do not assume that the given features are independent. On the contrary, a typical classification path from the root to a leaf in one particular tree/classifier in the random forest would be e.g. to apply a different rule on the successes feature, based on the value of the attempts feature. So as one comment suggests, the algorithm will be able to identify certain dependencies.

However, you need to keep in mind that decision trees (and in certain sense random forests as a consequence) define only linear separations between classes. Thus, to enhance the domain space, you might want to try to "hint" to the algorithm some additional meta-features, and possible semantics among the features. For example, have you considered also introducing success ratio (successes divided by attempts) as an additional feature?

Notice that additional features are not guaranteed to help, even if the easiest way to know is to try them. The reason for this is that the algorithm might already be able to learn the additional semantics you give it. Having said that, to me it is not obvious that a random forest would be able to "learn" a feature like success ratio.


Posted 2018-06-21T21:20:55.933

Reputation: 664

Thanks mapto for the answer! Wouldn't my individual trees make wrong decisions in the presence of number of attempts individually? As obviously 300 attempts with 2 successes is much worse that 300 attempts with 150 successes. Although this is pretty rare since they are correlated. And for the meta-feature, I was thinking of the ratio which would help in some cases. But then there is the problem that 2/10 cannot be treated the same as 20/100. So I have the same problem that my ratio would be complementary with the number of attempts. – Tom – 2018-06-22T11:04:48.843

1I was thinking of some sort of binomial test score for the meta feature too. Like under H_0 p=0.5 (I don't think the value of p matters as long as it is the same for all the observations). And then my feature would be the probability of the alternative hypothesis H_1 p < 0.5. So that both the ratio and the number of trials would be taken into account. – Tom – 2018-06-22T11:11:54.457

Of course it is possible that the data is misinterpreted, but this is a general risk you need to manage and random forests are good with that. You can always try to train the model without this feature, but I wouldn't expect to achieve better results. If in your case you believe that such a risk is high, maybe some move advanced form of https://en.wikipedia.org/wiki/Dimensionality_reduction could be useful.

– mapto – 2018-06-22T11:15:02.147

Sure you can use other formulas. E.g. to reward both proportion of successes, and number of attempts, you can augment the ratio to be X * attempts + successes/attempts, where X is a weight to balance the ranges of the two formula components, maybe X could be some 1/upper_bound(attempts) to normalise it into [0,1]. However, notice that such a meta-feature is a linear function, so your decision trees might already be able learn it on their own from the available features. It's worth playing around with different formulations to get both evidence and gut feeling of what works and what doesn't. – mapto – 2018-06-22T11:34:11.203

Thanks for the good comments Mapto! I'll just try with a few different features and check the results on my cross validation. I'll go with the simple ratio and the binomial feature and see what happens. In any case I have lots more features, so it should not matter so much. It is mostly to have a clear mind :) – Tom – 2018-06-22T12:07:36.170


Agree with @mapto. A good decision would be to do some preprocessing and merge two features into a new one. The success fraction could be a good one, but you can think on your own as well.

Daniel Chepenko

Posted 2018-06-21T21:20:55.933

Reputation: 271