Why is pandas corr() deleting columns?

4

I'm doing a basic correlation analysis but for some reason pandas corr() is deleting columns, not sure why.

import pandas as pd    
data = pd.read_csv("data.csv")
print(len(data.columns))
print(len(data.corr().columns))

Output:

100
64

raulb1

Posted 2019-10-30T15:55:44.493

Reputation: 87

Answers

7

Pearson's correlation is the default correlation used with Pandas corr method.

Categorical features ( not numerical ) are ignored during this process due to their nature of not being continuous. It makes no sense to say if categorical_var1 is increased by one , categorical_var2 also increases by X ( X's value depends on the correlation between the 2 variables ).

That's why you only see numerical variables! There are other statistical tests you can apply to categorical variables to better understand them.

Note : some columns may appear as numerical at first glance, but a string may be there due to an input mistake, or simply when the formatting of the file was done, that column type was set to 'Object'. Make sure to test the values in your supposedly numerical columns and apply astype to set them back to int or float

Blenz

Posted 2019-10-30T15:55:44.493

Reputation: 1 704

1Thanks for this clarification Blenz. But all columns have numerical values, there's no categorical data in this dataset. – raulb1 – 2019-10-30T21:55:03.523

2Check the types if the columns by doing : df.dtypes. I'm sure either a string has slipped through your radar into numerical data, or the formatting of some columns was done to output strings instead of int variables. If so, set the columns back to np.int32 or 64 using astype. – Blenz – 2019-10-31T08:32:27.767

3That's correct, it was the formatting and some NaN values. Many thanks! – raulb1 – 2019-10-31T09:33:42.497