I am currently tackling the Titanic Dataset on Kaggle. The goal is to find a classifier that can predict whether a passenger will survive or die.
The dataset has features that I believe are strongly linked to the likelihood of survival, i.e. the passenger class. Clearly, the survival probability is higher for passengers travelling in class 1 than in class 3. Additionally, there is a "Fare" attribute, which describes the exact price a passenger has paid.
I have tried different classifiers (e.g. Stochastic Gradient Descent and Random Forest) and both have a better accuracy when I drop the passenger class attribute.
Another thing I have noticed is, that if I condense the attributes "Siblings" and "Parents/Children" into a single attribute "TravelledAlone" (which can be 1 or 0), the classifier becomes more accurate.
In total these steps increase the accuracy from about 70% to almost 90%.
Can this be explained? After all, my results have become better by throwing away information...