Quick guide into training highly imbalanced data sets



I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test samples as a majority class.

Some good answers about sub-sampling and weighted random forest are given here: What are the implications for training a Tree Ensemble with highly biased datasets?

Which classification methods besides RF can handle the problem in the best way?


Posted 2014-09-12T15:20:51.767

Reputation: 4 894

See also https://stats.stackexchange.com/q/247871/232706

– Ben Reiniger – 2019-08-25T02:07:42.050



  • Max Kuhn covers this well in Ch16 of Applied Predictive Modeling.
  • As mentioned in the linked thread, imbalanced data is essentially a cost sensitive training problem. Thus any cost sensitive approach is applicable to imbalanced data.
  • There are a large number of such approaches. Not all implemented in R: C50, weighted SVMs are options. Jous-boost. Rusboost I think is only available as Matlab code.
  • I don't use Weka, but believe it has a large number of cost sensitive classifiers.
  • Handling imbalanced datasets: A review: Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas'
  • On the Class Imbalance Problem: Xinjian Guo, Yilong Yin, Cailing Dong, Gongping Yang, Guangtong Zhou


Posted 2014-09-12T15:20:51.767

Reputation: 356


Undersampling the majority class is usually the way to go in such situations.

If you think that you have too few instances of the positive class, you may perform oversampling, for example, sample 5n instances with replacement from the dataset of size n.


  • Some methods may be sensitive to changes in the class distribution, e.g. for Naive Bayes - it affects the prior probabilities.
  • Oversampling may lead to overfitting

Alexey Grigorev

Posted 2014-09-12T15:20:51.767

Reputation: 2 460

Or maybe try some clustering algorithm and use the cluster centres? – Leela Prabhu – 2017-02-01T08:18:51.520

You could check this link for oversampling and other methods to deal with imbalanced datasets.

– janpreet singh – 2017-06-09T13:22:42.443


Gradient boosting is also a good choice here. You can use the gradient boosting classifier in sci-kit learn for example. Gradient boosting is a principled method of dealing with class imbalance by constructing successive training sets based on incorrectly classified examples.


Posted 2014-09-12T15:20:51.767

Reputation: 951


My understanding is that gradient boosting suffers from the same limitations as RF when dealing with imbalanced data: http://sci2s.ugr.es/keel/pdf/algorithm/articulo/2010-IEEE%20TSMCpartA-RUSBoost%20A%20Hybrid%20Approach%20to%20Alleviating%20Class%20Imbalance.pdf

– charles – 2014-09-15T13:02:40.797

1Boosting is an additional step you take in building the forest that directly addresses imbalance. The paper you link notes this in the intro stating boosting helps even in cases where there is no imbalance. And that paper concludes boosting significantly helps. So not sure where the equivalence between RF and boosting is shown there? – cwharland – 2014-09-15T13:09:40.260


In addition to the answers posted here, if the number of positive examples are way too small when compared to the negative examples, then it comes close to being an anomaly detection problem where the positive examples are the anomalies.

You have a whole range of methods for detecting anomalies ranging from using multivariate gaussian distribution to model all the points and then picking those that are 2 or 3 stds away from the mean.

Another food for thought - I have seen quite a few people who randomly sample the negative examples with more examples so that both the classes are same in number. It totally depends on the problem at hand, whether we want them to be balanced or not.


Posted 2014-09-12T15:20:51.767

Reputation: 359