Which between random forest or extra tree is best in a unbalance dataset?



I have an unbalanced dataset, with 3 classes, with 60% of class 1, 38% of class 2, and 2% of class 3.

I don't want to generate more examples of class 3, and I cannot get more examples of class 3.

The problem is that I need to choose between RandomForest, and ExtraTree (this is homework), and explain why I choose one of these.

So I choose the Random Forest classifier, but I am not sure if my assumptions are right or no.

I choose that, because, the split of extra tree is random, so the probabilities of the pick some examples of class 3 are low, and because I think (this is the real question) that because Random is more high-variance than Extra tree, can be more useful because the high variance can help with the dataset is unbalance.

So are this two assumption especially the last one, correct? I choose correctly random forest over extra tree?



Posted 2020-06-21T20:08:29.803

Reputation: 317



Both Random Forest Classifier and Extra Trees randomly sample the features at each split point, but because Random Forest is greedy it will try to find the optimal split point at each node whereas Extra trees selects the split point randomly.

I would choose Random Forest because it's more likely to create a split point that accounts for the imbalanced class, whereas Extra Trees might keep splitting over and over again on a subset of the data without separating out class 3 due to the random split point.

Derek O

Posted 2020-06-21T20:08:29.803

Reputation: 325

Thanks for the answer, how is the correct point of view about the low-high bias-variance trade-off, is better use high-variance and low-bias for this case? – Tlaloc-ES – 2020-06-22T08:49:45.360