How to learn a classifier from a dataset with high imbalance



What are the most useful techniques for learning a binary classifier from a dataset with a high degree of imbalance (i.e., a dataset with the "target" class being much rarer than the "background" class)? For example,

  • Should one first down-sample the majority/background class to reduce its frequency and then readjust the probabilities reported by the learning algorithm? How should one do the readjustment?
  • Should one use different approaches for different learning algorithms, i.e., are there different techniques for dealing with imbalance in SVM, random forests, logistic regression, etc.?


Posted 2015-01-25T23:09:22.863

Reputation: 321



A common strategy for dealing with imbalance is to penalize harder the missclassifications that select the class with higher frequency.

In a binary classification problem you could penalize by dividing 1/n where n is the number of examples of the opposite class.

See the following from Prof. Jordi Vitriá

enter image description here

This is the loss function for structured output SVM.

The problem you mention is common in object recognition and object classification in images where much more background images are used than images containing the object. A stronger case happens with exemplar SVM's where just a single image of the object is used.


Posted 2015-01-25T23:09:22.863

Reputation: 1 440


If the data set is highly imbalanced i would suggest you yo use Structural SVM instead of basic classification model.

From section 3.3 of the paper - predicting structured objects with support vector machines1 What does this mean for learning? Instead of optimiz- ing some variant of error rate during training, which is what conventional SVMs and virtually all other learning algo- rithms do, it seems like a natural choice to have the learning algorithm directly optimize, for example, the F 1 -Score (i.e., the harmonic mean of precision and recall). This is the point where our binary classification task needs to become a struc- tured output problem, since the F 1 -Score (as well as many other IR measures) is not a function of individual examples (like error rate), but a function of a set of examples. In partic- ular, we arrive at the structured output problem of predict- ing an array of labels y = ( y 1 , ..., y n ), y i Î {−1, +1}, for an array of feature vectors x = (x 1 , ..., x n ), x i Î Â N . Each possible array of labels y– now has an associated F 1 -Score F 1 (y, – y ) w.r.t. the true labeling y, and optimizing F 1 -Score on the training set becomes a well-defined problem.

I would suggest you to read the entire paper. For implementation you can use python pystruct library2


Posted 2015-01-25T23:09:22.863

Reputation: 1


I would also suggest you to try an idea of 'Anomaly detection' using Gaussian distribution. In some cases it works really good - especially if you have a VERY skewed classes (say, among a million of examples only 10-20 are '1' (in class) and all the rest a 0's). You may look up it in this video by prof. Andrew Ng.

Or in text:

Notice, that this is not a classification problem, it is not using a classification algorithm.

Maksim Khaitovich

Posted 2015-01-25T23:09:22.863

Reputation: 383