Is a correlation matrix meaningful for a binary classification task?

4

2

When examining my dataset with a binary target (y) variable I wonder if a correlation matrix is useful to determine predictive power of each variable.

My predictors (X) contain some numeric and some factor variables.

Georg Heiler

Posted 2016-10-04T14:32:19.823

Reputation: 277

Answers

7

Well correlation, namely Pearson coefficient, is built for continuous data. Thus when applied to binary/categorical data, you will obtain measure of a relationship which does not have to be correct and/or precise.

There are quite a few answers on stats exchange covering this topic - this or this for example.

HonzaB

Posted 2016-10-04T14:32:19.823

Reputation: 1 521

3

It depends. Suppose you have a number of features, say 20, for your binary classification task. It might be the case that out of these 20 features some features are highly correlated. That may introduce some sort of redundant features in your feature space, so you may start to figure out which features to drop and still achieve a good result.

In some cases, the task still be a binary classification task and none of the features may be correlated and in that case you would like to incorporate all the features for training your model and making predictions.

If you are using Python then in order to find out which features are correlated and by how much, it is always useful to plot a scatter matrix using pandas which shows how each feature is correlated to other features. The same thing can be viewed even more clearly by plotting the features on a 'heatmap' provided by the seaborn library.

enterML

Posted 2016-10-04T14:32:19.823

Reputation: 2 651

Indeed I already tried both pandas and seaborn. Still, I am unsure if a normal correlation is the right metric for a binary outcome. – Georg Heiler – 2016-10-04T16:35:16.377