I'm working on the titanic data set and I've split my data into 3 groups:
# nominal variables nom_vars = ['Survived', 'Title', 'Embarked', 'Sex', 'Alone'] # ordinal variables ord_vars = ['Survived', 'Pclass', 'FamilySize'] # continuous variables cont_vars = ['Survived', 'Fare', 'Age']
In order to determine association, I used Cramer's V, Kendall's Tau and Pearson's R respectively. From these scores, I want to choose which features to keep/discard.
Now I'm having second thoughts... Each set contains the "Survived" variable. I interpret "Survived" as a nominal and dichotomous. Considering the last set, I know that you can use Pearson's on dichotomous variables and it's just called "point biserial" and shouldn't introduce any problems.
However, I'm worried about mixing it in with the second set of variables and having a nominal/ordinal mix.
Was this choice inappropriate? If so, what is an easily implemented alternative association metric? I found Cramer's easy to code by looking at the formula on wikipedia, and Kendall's is included in the Pandas library so that was great...