## How to do feature analyzing : pandas groupby(). mean

1

I'm analyzing Titanic data from Kaggle1 with a kind of guiding book. In the book, feature analyzing about the relationship between Pclass data and Survived data is done like below.

train_set = pd.read_csv('train_csv) fig = plt.figure(figsize=(12,4)) ax1 = fig.add_subplot(121) PclassPlot = train_set['Survived].groupby(train_set['Pclass]).mean() ax.bar(x=PClassPlot.index, height=PClassPlot.values) 
Why do you need to use mean data? I thought I didn't need to culcalate mean to see the relationship between Pclass and Survived data.

Please give me tips for analyzing data. Thank you for your help.

What did you do? You should post feedback also on your questions here. – rnso – 2018-12-31T01:48:43.940

@rnso Thank you for your help a lot! I understand what groupby. mean() means for checking relationships between two variables. – Yuki.U – 2018-12-31T02:55:51.137

1

Means are calculated just to have an idea of the relationship. For more definite analysis, one can use simple statistics to know relationship between two variables. The test to be applied depends on type of variables:

If both are numeric: Correlation (Pearson or Spearman)

If both are grouping (nominal) variables: Chi-square test
Fisher exact test can also be used if only 2 groups in each variable.

If one is grouping and other numeric:
if only 2 groups: Student t-test or Mann-Whitney U test
if more than 2 groups: ANOVA or Kruskal–Wallis test


Means of different groups will give you only partial information about relationship. However, it may be sufficient to know if the feature is likely to be important for prediction. As a method of feature selection, if groups' mean do not differ significantly, this feature may be dropped from analysis.