4

2

When examining my dataset with a binary target (y) variable I wonder if a correlation matrix is useful to determine predictive power of each variable.

My predictors (X) contain some numeric and some factor variables.

4

2

When examining my dataset with a binary target (y) variable I wonder if a correlation matrix is useful to determine predictive power of each variable.

My predictors (X) contain some numeric and some factor variables.

7

Well correlation, namely Pearson coefficient, is built for continuous data. Thus when applied to binary/categorical data, you will obtain measure of a relationship which does not have to be correct and/or precise.

There are quite a few answers on stats exchange covering this topic - this or this for example.

3

It depends. Suppose you have a number of features, say 20, for your binary classification task. It might be the case that out of these 20 features some features are highly correlated. That may introduce some sort of redundant features in your feature space, so you may start to figure out which features to drop and still achieve a good result.

In some cases, the task still be a binary classification task and none of the features may be correlated and in that case you would like to incorporate all the features for training your model and making predictions.

If you are using Python then in order to find out which features are correlated and by how much, it is always useful to plot a scatter matrix using pandas which shows how each feature is correlated to other features. The same thing can be viewed even more clearly by plotting the features on a 'heatmap' provided by the seaborn library.

Indeed I already tried both pandas and seaborn. Still, I am unsure if a normal correlation is the right metric for a binary outcome. – Georg Heiler – 2016-10-04T16:35:16.377