I am trying to solve Kaggle's titanic competition. In the test set, there is only 1 row having Fare value as null. It's so easy to replace it with median or average of all Fare values.
But I am thinking of getting a graph showing relation between fares and classes (1, 2, 3) and fares with Embark field, to check some patterns and to narrow down the range of fare values before doing the calculations to replace null values.
The Embark field having 3 main values: S, Q and C.
The class field is having: 1, 2 and 3 as classes.
Fares are changing based on the class of course.
What I have done so far is:
fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(20,20)) sns.scatterplot(test['Embarked'], y='Fare' , data=test, ax=axs) sns.scatterplot(test['Age'], y='Fare', data=test, ax=axs) sns.scatterplot(test['Pclass'], y='Fare', data=test, ax=axs) sns.scatterplot(test['Sex'], y='Fare', data=test, ax=axs)
And the result was:
There is no patterns at the
But at the
Pclass plot, we can clearly see that there is stricly defined range for class 3.
I tried to check it at the train set:
So the are nearly the same distribution.
How can fill the null value with median of values range of people having
Pclass as 3?
the regular expression would be:
test['Fare'] = test['Fare'].fillna(test['Fare'].median())
But I really need to specify the range of values where Pclass=3