Dropping attributes leads to better classifier accuracy? (Titanic Set)


I am currently tackling the Titanic Dataset on Kaggle. The goal is to find a classifier that can predict whether a passenger will survive or die.

The dataset has features that I believe are strongly linked to the likelihood of survival, i.e. the passenger class. Clearly, the survival probability is higher for passengers travelling in class 1 than in class 3. Additionally, there is a "Fare" attribute, which describes the exact price a passenger has paid.

I have tried different classifiers (e.g. Stochastic Gradient Descent and Random Forest) and both have a better accuracy when I drop the passenger class attribute.

Another thing I have noticed is, that if I condense the attributes "Siblings" and "Parents/Children" into a single attribute "TravelledAlone" (which can be 1 or 0), the classifier becomes more accurate.

In total these steps increase the accuracy from about 70% to almost 90%.

Can this be explained? After all, my results have become better by throwing away information...

Leon Kuhn

Posted 2020-07-02T17:19:12.187

Reputation: 1

(SGD is not a classifier) – Sean Owen – 2020-07-02T17:55:59.627

Thanks for the comment, I've meant SGD-trained classifiers from (sklearn.linear_model.SGDClassifier) – Leon Kuhn – 2020-07-02T18:03:13.350



I can't say for sure why this happens for this particular data set, but, presumably "Fare" and "Class" are highly correlated. The classifier may do better reasoning about one of them rather than splitting its reasoning across both, as they have mostly the same information.

It's also possible indeed that just knowing whether the person had any family members is most of the useful info, and trying to reason about how many is unhelpful as it is mostly noise that the classifier will overfit.

It's also possible that, given the tiny data set size, you can find lots of ideas that happen to fit the data better, but wouldn't necessarily generalize.

Sean Owen

Posted 2020-07-02T17:19:12.187

Reputation: 5 987