Train Accuracy vs Test Accuracy vs Confusion matrix

15

14

After I developed my predictive model using Random Forest I get the following metrics:

        Train Accuracy ::  0.9764634601043997
        Test Accuracy  ::  0.7933284397683713
         Confusion matrix  [[28292  1474]
                            [ 6128   889]]

This is the results from this code:

  training_features, test_features, training_target, test_target, = train_test_split(df.drop(['bad_loans'], axis=1),
                                                  df['target'],
                                                  test_size = .3,
                                                  random_state=12)
clf = RandomForestClassifier()
trained_model = clf.fit(training_features, training_target)
trained_model.fit(training_features, training_target)
predictions = trained_model.predict(test_features)      

Train Accuracy: accuracy_score(training_target, trained_model.predict(training_features))
Test Accuracy: accuracy_score(test_target, predictions)
Confusion Matrix: confusion_matrix(test_target, predictions)

However I'm getting a little confuse to interpret and explain this values.

What exactly this 3 measures tell me about my model?

Thanks!

Pedro Alves

Posted 2018-02-28T21:07:32.770

Reputation: 347

Just to be clear, here your confusion matrix (and in general) when it is reported is based on test data. Because you could have it even for training data that you built the mode on. – TwinPenguins – 2018-02-28T23:09:52.770

I've some doubts to calculate this measures. Why for Train Accuracy put :(training_target, trained_model.predict(training_features) and not (training_target, trained_model.predict(test_target)? – Pedro Alves – 2018-02-28T23:26:47.337

The accuracy just for Class 1 is 77/94 ? – Pravin – 2019-01-03T10:53:33.300

Answers

23

Definitions

  • Accuracy: The amount of correct classifications / the total amount of classifications.
  • The train accuracy: The accuracy of a model on examples it was constructed on.
  • The test accuracy is the accuracy of a model on examples it hasn't seen.
  • Confusion matrix: A tabulation of the predicted class (usually vertically) against the actual class (thus horizontally).

Overfitting

What I would make up of your results is that your model is overfitting. You can tell that from the large difference in accuracy between the test and train accuracy. Overfitting means that it learned rules specifically for the train set, those rules do not generalize well beyond the train set.

Your confusion matrix tells us how much it is overfitting, because your largest class makes up over 90% of the population. Assuming that you test and train set have a similar distribution, any useful model would have to score more than 90% accuracy: A simple 0R-model would. Your model scores just under 80% on the test set.

In depth look at the confusion matrix

If you would look at the confusion matrix relatively (in percentages) it would look like this:

               Actual    TOT
               1    2
Predicted 1 | 77% | 4% | 81%  
Predicted 2 | 17% | 2% | 19%
TOT         | 94% | 6% |

You can infer from the total in the first row that your model predicts Class 1 81% of the time, while the actual occurrence of Class 1 is 94%. Hence your model is underestimating this class. It could be the case that it learned specific (complex) rules on the train set, that work against you in the test set.

It could also be worth noting that even though the false negatives of Class 1 (17%-point, row 2, column 1)) are hurting your overall performance most, the false negatives of Class 2 (4%-point, row 1 column 2) are actually more common with respect to the total population of the respective classes (94%, 6%). This means that your model is bad at predicting Class 1, but even worse at predicting Class 2. The accuracy just for Class 1 is 77/99 while the accuracy for Class 2 is 2/6.

S van Balen

Posted 2018-02-28T21:07:32.770

Reputation: 1 294

2Voted up for good answer. Maybe for educational purposes that would be better if you could elaborate on "how much it is overfitting" base on the actual confusion matrix elements. I am also curious to learn more. – TwinPenguins – 2018-02-28T23:05:41.397

1I added a more in depth look, let me know if this is what you are looking for. – S van Balen – 2018-03-01T07:57:29.257

So, for example when I'm getting this confusion matirx: Train Accuracy :: 0.8147735305312381 Test Accuracy :: 0.8086616099828725 Confusion matrix [[9870 16] [2330 45]]

It says that My model only have a precision of 73% – Pedro Alves – 2018-03-01T14:28:48.073

That confusion matrix would correspond to your test accuracy. (9870 + 45)/(9870+2330+16+45) = 0.80866161 – S van Balen – 2018-03-01T19:30:08.520

@SvanBalen nice pedagogic answer. What is a "0R-model"? does it go by any other names? – val – 2020-09-30T17:54:17.007

ZeroR is a lot easier to Google :) – S van Balen – 2020-10-05T07:42:13.840