Correlation and Naive Bayes



I would like to ask if the Pearson correlation between fields (but not the class field) of a dataset affects somehow the performance of Naive Bayes when applying it to the dataset in order to predict the class field.

Posted 2015-11-27T19:04:28.730

Reputation: 41



As you probably know, naive here implies that the "fields" are independent. So your question boils down to does correlation imply dependence. Yes, it does. See here.

So, if your features show correlation then this will have an adverse effect on the naive assumption. Despite this fact, Naive Bayes has been shown to be robust against this assumption. If your model still suffers from this, however, you could consider transforming the space to be independent with methods such as PCA.


Posted 2015-11-27T19:04:28.730



Yes, it will affect the performance of Naive Bayes.

It is called Naive because it assumes an independence between the features, which in practice is rarely the case. However, it's shown to be fairly robust to this and to be able to perform well on real-world problems. So having correlation will go against the Naive assumption.

However, correlation is not necessarily a bad or a good thing for the performance of your model. Correlation between features in Naive Bayes simply means that if one feature "says" it's class A, then the other feature(s) will often say the same. Therefore, if your correlated features happen to be good predictors, your model will actually benefit from it, if they happen to be bad predictors, your model will be worse off.

Valentin Calomme

Posted 2015-11-27T19:04:28.730

Reputation: 4 666