Can i expect good results having low correlation attributes?


This was a question i saw in an interview for a data scientist position:

"Here is the following correlation heatmap that i got from my attributes. Regarding the correlation of each feature with the dependent variable (target/class), it is noticeable that correlations are not very expressive.

enter image description here

Yet, i would like to know if can i expect good results from a classification model using this dataset. Also, what further investigation can i do (if i shouldn't look after correlation only)?"


Posted 2020-09-15T21:23:09.700

Reputation: 261

Did you try to train a classifier? – Sahar Milis – 2020-09-16T03:45:33.027

This was a question from an interview for data scientist position in a company of my town. – joann2555 – 2020-09-16T20:59:52.713



It's a general question, so there are more then a few things you can do.
Although, what stopping you to train a basic clssifier and investigate the results?

Some ideas:

  • Use Predictive Power Score to keep on investigate your data
  • Check for non-linear correlation between the features
  • Investigation the features importance
  • Use dimension reduction
  • Check for imbalances

Sahar Milis

Posted 2020-09-15T21:23:09.700

Reputation: 146

I should've explained that this was a question from an interview for a data scientist position in a company. I will edit the question. – joann2555 – 2020-09-16T21:00:53.240


The correlation does not effect your model using decision trees in a classification problem.

In the theory of decision tree models, you don`t need correlation or check of multicollinearity. Because the split in decision trees is made of entropy/information gain. The correlation does only check linear dependencies. The same is, when the dataset is highly correlated. You will get very good results with decision trees, because there you don´t need to delete correlated features or do dimension reduction (if you don´t have to).

It can be, that you don´t get very good results, when you use linear structured models like multiclass neural network, or multiclass logistic regression. There you will see that dimension reduction etc. can have a high influence on the accuracy in these models.

I had a similar question but with highly correlated features: decision -tree regression to avoid multicollinearity for regression model?

In your case I would say, if we use decision trees, it is not noticeable. However we should check this with the permutation importance of the features and check the polynomial dependencies. Of course you should ask the interviewer more question about his questions and the target of his question, to get more background information. This is very important in interviews.


Posted 2020-09-15T21:23:09.700

Reputation: 319