How to do feature analyzing : pandas groupby(). mean


I'm analyzing Titanic data from Kaggle1 with a kind of guiding book. In the book, feature analyzing about the relationship between Pclass data and Survived data is done like below.

train_set = pd.read_csv('train_csv) fig = plt.figure(figsize=(12,4)) ax1 = fig.add_subplot(121) PclassPlot = train_set['Survived].groupby(train_set['Pclass]).mean(), height=PClassPlot.values)
Why do you need to use mean data? I thought I didn't need to culcalate mean to see the relationship between Pclass and Survived data.

Please give me tips for analyzing data. Thank you for your help.


Posted 2018-12-30T14:51:00.253

Reputation: 83

What did you do? You should post feedback also on your questions here. – rnso – 2018-12-31T01:48:43.940

@rnso Thank you for your help a lot! I understand what groupby. mean() means for checking relationships between two variables. – Yuki.U – 2018-12-31T02:55:51.137



Means are calculated just to have an idea of the relationship. For more definite analysis, one can use simple statistics to know relationship between two variables. The test to be applied depends on type of variables:

If both are numeric: Correlation (Pearson or Spearman)

If both are grouping (nominal) variables: Chi-square test
   Fisher exact test can also be used if only 2 groups in each variable.

If one is grouping and other numeric: 
   if only 2 groups: Student t-test or Mann-Whitney U test
   if more than 2 groups: ANOVA or Kruskal–Wallis test

Means of different groups will give you only partial information about relationship. However, it may be sufficient to know if the feature is likely to be important for prediction. As a method of feature selection, if groups' mean do not differ significantly, this feature may be dropped from analysis.


Posted 2018-12-30T14:51:00.253

Reputation: 1 316