Good accuracy on train dataset with cross validation, but low accuracy on test dataset

1

1

I'm using the adult dataset from UCI and trying to predict the native country of a person, based on workclass, education.num, race, gender and income. I'm using sklearn and KNeighborsClassifier.

With this, predicting the test split I get 0.917252 accuracy, with K = 15. But, when I use the test dataset (given in the UCI link), that accuracy goes to less than 1%.

Where am I going wrong?

train = pd.read_csv('datasets/adult_data/adult.csv')
test = pd.read_csv('datasets/adult_data/adulttest.csv')

# Cleaning null values
train = train[train["workclass"] != " ?"]
train = train[train["occupation"] != " ?"]
train = train[train["native.country"] != " ?"]
test = test[test["workclass"] != " ?"]
test = test[test["occupation"] != " ?"]
test = test[test["native.country"] != " ?"]

category_col =['workclass', 'race', 'education','marital.status', 'occupation',
               'relationship', 'gender', 'native.country', 'income']
for col in category_col:
    b, c = np.unique(train[col], return_inverse=True) 
    train[col] = c

for col in category_col:
    b, c = np.unique(test[col], return_inverse=True) 
    test[col] = c

features = train.drop('native.country',axis=1)
target = train["native.country"]
features_test = test.drop('native.country', axis=1)
features_target = test["native.country"]

features_name = ['workclass', 'education.num', 'race', 'gender', 'income']

features = features[features_name]
features_test = features_test[features_name]

#spliting data into train and test data
X_train,X_test,y_train,y_test = train_test_split(features,target,random_state = 12)

k_values = np.arange(1,26)
scores = []

for i in k_values:
    clf = KNeighborsClassifier(n_neighbors=i)
    clf.fit(X_train,y_train)

    y_predict = clf.predict(features_test)
    # y_predict = clf.predict(X_train)

    scores.append(metrics.accuracy_score(features_target, y_predict))
    # scores.append(metrics.accuracy_score(y_train, y_predict))

print("Accuracy for {} is {}".format(np.argmax(scores),max(scores)))

plt.plot(np.arange(1,26),scores)
plt.title('Variation of accuracy with K value, with all features')
plt.xlabel('K values')
plt.ylabel('Accuracy')

plt.show()

Emanuel Huber

Posted 2017-12-26T17:33:36.667

Reputation: 23

Answers

1

I think you are using commented lines in the loop to calculate the cross validation accuracy.

# y_predict = clf.predict(X_train)

It should be

 y_predict = clf.predict(X_test)

And then you should check accuracy score with 'y_test' not 'y_train'.

You are predicting on the data KNN has learnt on. And since KNN algorithms kind of learns the data. You are getting good accuracy.

vc_dim

Posted 2017-12-26T17:33:36.667

Reputation: 178

With both train and test the accuarcy is really good, the problem is with test dataset – Emanuel Huber – 2017-12-27T10:50:34.497

It can be a case of KNN overfitting to your data as you are using single set of train-test pair. Try using cross_validate() from sklearn to see if on average the similar accuracy holds. You need to pass cv=n where n is the number of sets you want it to create. – vc_dim – 2017-12-27T13:39:02.810