Kaggle Titanic submission score is higher than local accuracy score

3

This is the starter challenge, Titanic. The original question I posted on Kaggle is here. However, nobody really gives any insightful advice so I am turning to the powerful Stackoverflow community.

Based on this Notebook, we can download the ground truth for this challenge and get a perfect score.

I tested it and it does give me 100% on LB for the purpose of confirming it is the ground truth as it claims. (side question here: how do I remove this perfect submission because now it shows I have 100% on this challenge but I want to show my real score, which is roughly 80% and I will keep improving)

Sometimes submission on Kaggle takes several minutes to get back the score so I used the ground truth locally to test my different models to save time. However, they always give me different results. See the following:

enter image description here

These are the code I use, what's wrong? You can use my code to try your submission and do you also have the same problem?

def mark(pred):
    solution = os.path.join(dirname, './output/solution.csv')
    submission = os.path.join(dirname, './output/'+pred)
    solution = pd.read_csv(solution)
    submission = pd.read_csv(submission)

    solution.columns = ['PassengerId', 'Sol']
    submission.columns = ['PassengerId', 'Pred']

    df = pd.concat([solution[['Sol']], submission[['Pred']]], axis=1)
    num_row = df.shape[0]
    print(pred[:-4], '==', (df[(df['Sol'] == df['Pred'])]).shape[0] / num_row)

if __name__== "__main__":
    mark('achieve_99_dtree_rfe.csv')
    mark('advanced_feature_with_stacking_5_fold.csv')

Kenny

Posted 2019-12-20T17:08:38.197

Reputation: 31

Answers

0

First question: on certain competitions on kaggle you can select your submission when you go to the submissions window. There you may not be able to on titanic one so you are stuck with 100 percent.

Regarding your second question and why do you have different results locally, there is a couple of explanations, first its randomization did you set all the seeds, second all of the modules/library versions you have locally sam as the ones on kaggle? Third different hardware could also be the reason

user87235

Posted 2019-12-20T17:08:38.197

Reputation:

Thanks for replying. As I understand, I only submit my predication.csv, they count how many predications do I get right for same PassengerId, for example, if PassengerId 1000 is 1 but my predication is 0, then this will not be counted towards my final score, they don't see my model/algorithm, so how is that related to seeds/library/hardware? – Kenny – 2019-12-21T04:14:49.747