Is my model over fitting or not?

4

1

I have 50000 observations with 70% positive and 30% negative target variable. I'm getting accuracy of around 96-99% which seems unreal of course and I'm worried that my model is over-fitting which I don't understand why. I replaced all outliers with 5th and 95th quantile. Standardized the data yet it is showing this unreal accuracy. A bit of online search and people suggested to check for the difference between training and test data accuracy, which i did and for Random Forest it came

Training Accuracy: 0.997975
Test Accuracy: 0.9715


For logistic regression it shows

Training Accuracy: 0.967225
Test Accuracy: 0.9647


This is the code I used for running the model:

clf = LogisticRegression()
trained_model = clf.fit(X_train, y_train)
trained_model.fit(X_train, y_train)
predictions = trained_model.predict(X_test)

accuracy_score(y_train, trained_model.predict(X_train))
accuracy_score(y_test, predictions)


I also tried kfold cross validation which gave similar results

skfold = StratifiedKFold(n_splits=10, random_state=100)
model_skfold = LogisticRegression()
results_skfold = model_selection.cross_val_score(model_skfold, X, Y, cv=skfold)
print("Accuracy: %.2f%%" % (results_skfold.mean()*100.0))


Lastly I applied regularization technique to check for results and this is the result I got

for c in C:
clf = LogisticRegression(penalty='l1', C=c, solver='liblinear')
clf.fit(X_train, y_train)
y_pred_log_reg = clf.predict(X_test)
acc_log_reg = round( clf.score(X_train, y_train) * 100, 2)
print (str(acc_log_reg) + ' percent')
print('C:', c)
print('Coefficient of each feature:', clf.coef_)
print('Training accuracy:', clf.score(X_train_std, y_train))
print('Test accuracy:', clf.score(X_test_std, y_test))
print('')


The results

96.72 percent
C: 10
Coefficient of each feature: [[-2.50e+00 -1.40e-03  2.65e+00  4.09e-02 -2.03e-03  2.75e-04  1.79e-02
-2.13e-03 -2.18e-03  2.90e-03  2.69e-03 -4.93e+00 -4.89e+00 -4.88e+00
-3.27e+00 -3.30e+00]]
Training accuracy: 0.5062
Test accuracy: 0.5027

96.72 percent
C: 1
Coefficient of each feature: [[-2.50e+00 -1.41e-03  2.66e+00  4.10e-02 -2.04e-03  2.39e-04  1.68e-02
-3.29e-03 -3.80e-03  2.52e-03  2.62e-03 -4.22e-02 -9.55e-03  0.00e+00
-1.73e+00 -1.77e+00]]
Training accuracy: 0.482525
Test accuracy: 0.4738

96.74 percent
C: 0.1
Coefficient of each feature: [[-2.46e+00 -1.38e-03  2.58e+00  4.03e-02 -1.99e-03  2.22e-04  1.44e-02
-4.49e-03 -5.13e-03  2.03e-03  2.20e-03  0.00e+00  0.00e+00  0.00e+00
0.00e+00 -6.54e-03]]
Training accuracy: 0.616675
Test accuracy: 0.6171

95.92 percent
C: 0.001
Coefficient of each feature: [[-1.43e+00 -6.82e-04  1.19e+00  2.73e-02 -1.10e-03  1.22e-04  0.00e+00
-2.74e-03 -2.55e-03  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
0.00e+00  0.00e+00]]
Training accuracy: 0.655075
Test accuracy: 0.6565


The codes I used for Standardization and replace outliers

std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std  = std_scale.transform(X_test)

X.clip(lower=X.quantile(0.05), upper=X.quantile(0.95), axis = 1, inplace = True)


Do let me know if any other information is required and any guidance will be appreciated

1

Cloud possibly be data leakage: https://machinelearningmastery.com/data-leakage-machine-learning/

But I have scaled test data using the train information as recommended, and even if I don't scale it, the problem persists. – hyeri – 2020-02-25T16:25:57.960

1Let's not forget the possibility that the method and the results are correct. It's perfectly possible in some tasks/data to obtain very high performance, it depends what you are doing: are you sure this performance is unrealistic for your data/task? To me the difference between training/testing performance is not that high, I'm not sure it proves overfitting. – Erwan – 2020-02-25T18:04:13.587

To be honest, I have no idea, this is for my interview assessment hence I just have a data with no clue of what could be the realistic answer. I excluded two variables from this data and the accuracy dropped to 70% could it be the reason that the variables are highly correlating? I have included a heatmap as well. I will try forward or backward selection now – hyeri – 2020-02-25T18:13:28.160

@hyeri if this is part of an interview assessment it might be a simple exercise meant to check whether the candidate can run a basic ML job. Without any other indication I'd say that your results are probably correct. – Erwan – 2020-02-25T19:47:40.003

2Sometimes the leaks can be in the variables themselves. But that would be unlikely for an interview assessment. I think @Erwan is most likely correct. It is not uncommon to get a simple problem where they are mostly interested in how it is solved rather than the accuracy of the solution. – Simon Larsson – 2020-02-26T06:14:22.763

Have you tried with different random states? That might shake things up. – bstrain – 2020-03-02T13:30:39.283

1

I have less than enough karma to leave a comment, but I'd like to support Derek O's assessment but also add 1 more point: if your 50K observations are repeated measures (multiple rows coming from the same individual or unit) - then you'll want to make sure that in your cross-fold setup that you are ensuring that 100% of each individual's observations are falling into the same fold.

An analogy would be that if you were modeling teeth for a probability of developing a cavity - a human head has 32 teeth, and if a particular human's teeth end up in both your testing and training folds, this would be considered a form of leakage. This is because the 32 teeth are not totally independent, but correlated with other teeth within the same head. This particular form of leakage often slips people's minds.

0

It's an imbalanced classes problem, however, it's not a very imbalanced dataset. It's common question/task in interviews. You may get high accuracy because minor class has less weight in the model.

This topic has been discussed several time here and here.

0

If your results seem too good to be true, it probably is.

I think the most likely reason is data leakage. If you performed any standardization/normalization, or even imputation using the entire data set without using a pipeline for the k-folds, you've biased your dataset and when you do k-fold cross validation, each "training fold" was created using information that wasn't supposed to known to the model (i.e. the "training fold" was standardized using information from the "testing fold"). This leads to biased models that basically already know the answer ahead of time, resulting in an inflated accuracy that won't generalize once you run your model on data it hasn't seen before.

The way to combat this is to perform standardization/imputation within each k-fold, or hold back a validation set and perform standardization/imputation on the training and validation sets separately.

Jason Brownlee's article on data leakage is a good resource on understanding the issue in more depth.

I suppose the class imbalance in your dataset could also be the culprit, but I don't think a 70/30 class imbalance can explain a 96-99% accuracy.