Good accuracy on train dataset with cross validation, but low accuracy on test dataset



I'm using the adult dataset from UCI and trying to predict the native country of a person, based on workclass, education.num, race, gender and income. I'm using sklearn and KNeighborsClassifier.

With this, predicting the test split I get 0.917252 accuracy, with K = 15. But, when I use the test dataset (given in the UCI link), that accuracy goes to less than 1%.

Where am I going wrong?

train = pd.read_csv('datasets/adult_data/adult.csv')
test = pd.read_csv('datasets/adult_data/adulttest.csv')

# Cleaning null values
train = train[train["workclass"] != " ?"]
train = train[train["occupation"] != " ?"]
train = train[train[""] != " ?"]
test = test[test["workclass"] != " ?"]
test = test[test["occupation"] != " ?"]
test = test[test[""] != " ?"]

category_col =['workclass', 'race', 'education','marital.status', 'occupation',
               'relationship', 'gender', '', 'income']
for col in category_col:
    b, c = np.unique(train[col], return_inverse=True) 
    train[col] = c

for col in category_col:
    b, c = np.unique(test[col], return_inverse=True) 
    test[col] = c

features = train.drop('',axis=1)
target = train[""]
features_test = test.drop('', axis=1)
features_target = test[""]

features_name = ['workclass', 'education.num', 'race', 'gender', 'income']

features = features[features_name]
features_test = features_test[features_name]

#spliting data into train and test data
X_train,X_test,y_train,y_test = train_test_split(features,target,random_state = 12)

k_values = np.arange(1,26)
scores = []

for i in k_values:
    clf = KNeighborsClassifier(n_neighbors=i),y_train)

    y_predict = clf.predict(features_test)
    # y_predict = clf.predict(X_train)

    scores.append(metrics.accuracy_score(features_target, y_predict))
    # scores.append(metrics.accuracy_score(y_train, y_predict))

print("Accuracy for {} is {}".format(np.argmax(scores),max(scores)))

plt.title('Variation of accuracy with K value, with all features')
plt.xlabel('K values')

I think you are using commented lines in the loop to calculate the cross validation accuracy.

# y_predict = clf.predict(X_train)

It should be

 y_predict = clf.predict(X_test)

And then you should check accuracy score with 'y_test' not 'y_train'.

You are predicting on the data KNN has learnt on. And since KNN algorithms kind of learns the data. You are getting good accuracy.


With both train and test the accuarcy is really good, the problem is with test dataset – Emanuel Huber – 2017-12-27T10:50:34.497

It can be a case of KNN overfitting to your data as you are using single set of train-test pair. Try using cross_validate() from sklearn to see if on average the similar accuracy holds. You need to pass cv=n where n is the number of sets you want it to create. – vc_dim – 2017-12-27T13:39:02.810