## Is numpy.corrcoef() enough to find correlation?

1

I am currently working through Kaggle's titanic competition and I'm trying to figure out the correlation between the Survived column and other columns. I am using numpy.corrcoef() to matrix the correlation between the columns and here is what I have:

The correlation between pClass & Survived is: [[ 1.         -0.33848104]
[-0.33848104  1.        ]]

The correlation between Sex & Survived is: [[ 1.         -0.54335138]
[-0.54335138  1.        ]]

The correlation between Age & Survived is:[[ 1.         -0.07065723]
[-0.07065723  1.        ]]

The correlation between Fare & Survived is: [[1.         0.25730652]
[0.25730652 1.        ]]

The correlation between Parent-Children & Survived is: [[1.         0.08162941]
[0.08162941 1.        ]]

The correlation between Sibling-Spouse & Survived is: [[ 1.        -0.0353225]
[-0.0353225  1.       ]]

The correlation between Embarked & Survived is: [[ 1.         -0.16767531]
[-0.16767531  1.        ]]


There should be higher correlation between Survived and [pClass, sex, Sibling-Spouse] and yet the values are really low. I'm new to this so I understand that a simple method is not the best way to find correlations but at the moment, this doesn't add up.

This is my full code (without the printf() calls):

import pandas as pd
import numpy as np

survived = train['Survived']
pClass = train['Pclass']
sex = train['Sex'].replace(['female', 'male'], [0, 1])
age = train['Age'].fillna(round(float(np.mean(train['Age'].dropna()))))
fare = train['Fare']
parch = train['Parch']
sibSp = train['SibSp']
embarked = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])


why do you think the values should be higher? – nairboon – 2019-05-14T09:56:17.537

Because there is a strong correlation between sex, class and survival. Women and rich passengers were most likely to survive. – Andros Adrianopolos – 2019-05-14T09:59:13.500

4

On a side note, I don't think correlation is the correct measure of relation for you to be using, since Survived is technically a binary categorical variable.

"Correlation" measures used should depend on the type of variables being investigated:

1. continuous variable v continuous variable: use "traditional" correlation - e.g. Spearman's rank correlation or Pearson's linear correlation.
2. continuous variable v categorical variable: use an ANOVA F-test / difference of means
3. categorical variable v categorical variable: use Chi-square / Cramer's V

1

Here is a closely related old post.

– Esmailian – 2019-05-18T15:29:47.360

@bradS When you say ANOVA F-test/difference of means, do you mean dividing ANOVA F-test by difference of means? – Andros Adrianopolos – 2019-05-19T17:50:16.807

@AtillaAdrianopolos, no I mean "/" as "or". Using item 3 above as an example, use Chi-square test of independence or Cramer's V. – bradS – 2019-05-20T08:09:17.917

1

You probably encoded Women as 0 and men as 1 that's why you get a negative correlation of -0.54, because Survived is 0 for No and 1 for Yes. Your calculation actually show what you've expected. The negative correlation is only about the direction depending on your encoding, the relationship between Women and Survived is 0.54.

Similarly pClass is correlated negatively with -0.33 because the highest class (1st class) is encoded as 1 and the lowest as 3, thus the direction is negative.

You could make the relations more intuitive if you make new columns for men and women where you put 0 and 1 depending on the sex, then the correlations will have the intuitive direction (sign). The same holds for pClass.

What if I encode male/female with 3/4 instead? They're still binary values and just might solve the problem you're raisng. – Andros Adrianopolos – 2019-05-14T10:15:33.553