I got 100% accuracy on my test set,is there something wrong?

8

3

I got 100% accuracy on my test set when trained using decision tree algorithm.but only got 85% accuracy on random forest

Is there something wrong with my model or is decision tree best suited for the dataset provided.

Code:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

#Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(x_train, y_train);
predictions = rf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)
print(cm)

#Decision Tree

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)

Confusion Matrix:

Random Forest:

[[19937  1]
 [    8 52]]

Decision Tree:

[[19938  0]
 [    0 60]]

Harigovind Valsakumar

Posted 2018-07-19T08:16:21.663

Reputation: 83

Can we see the data? What is your train score? – Sam – 2018-07-19T10:55:56.530

I can't show the data as it is private I am doing transaction fraud detection the data is of actual transactions but about 99% of the data belongs to one class – Harigovind Valsakumar – 2018-07-20T04:08:00.520

AH OK. What is the metric you are using to evaluate model performance? Giving your class imbalance, I'd recommend using ROC AUC. If you cannot provide the data, could you show us a confusion matrix of your results? – Sam – 2018-07-20T05:19:07.340

confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]] – Harigovind Valsakumar – 2018-07-20T05:31:47.263

Hm. Can you show us some code? – Sam – 2018-07-20T05:36:23.633

I added the code in my question. I used the default value for parameters. – Harigovind Valsakumar – 2018-07-20T05:44:19.477

roc value for random forest : 0.8982800082624888 roc value for decision tree : 0.9915254237288136 – Harigovind Valsakumar – 2018-07-20T06:00:55.020

Have you looked at the decision tree vs random forest tree to see where the discrepancies of splits are? There will be some element which is different in the decision tree than that of the RF model. Let me know – Sam – 2018-07-20T07:15:06.200

Is it just me or is your Random Forest accuracy more like 99.95% not 85%? Based on your decision matrix. What am I missing here? – Elegant Code – 2020-01-09T16:34:01.950

Answers

5

There may be a few reason this is happening.

  1. First of all, check your code. 100% accuracy seems unlikely in any setting. How many testing data points do you have? How many training data points did you train your model on? You may have made a coding mistake and compared two same list.

  2. Did you use different test set for testing? The high accuracy may be due to luck - try using some of the KFoldCrossValidation libraries that are widely available.

  3. You can visualise your decision tree to find our what is happening. If it has 100% accuracy on the test set, does it have 100% on the training set ?

c zl

Posted 2018-07-19T08:16:21.663

Reputation: 116

confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]] – Harigovind Valsakumar – 2018-07-20T04:03:59.773

1Is this for training data or test data? What's the performance on training data? Please read my first 2 points again; did you ensure you are not training on the test data? – c zl – 2018-07-20T05:49:00.337

For test data about 80k rows in train and 20k rows in test – Harigovind Valsakumar – 2018-07-20T05:59:14.680

1You are not answering my questions.. – c zl – 2018-07-20T06:04:27.407

I used training data for training score : 0.9999499949995 100 accuracy for test data – Harigovind Valsakumar – 2018-07-20T06:11:39.747

cross_val_score mean ==.71, 99% of the data belong to 1 class – Harigovind Valsakumar – 2018-07-20T06:20:53.030

Hey, looking at your code, you are overfitting your Random Forest Classifier. Please look at the scikit-learn documentation for Random Forest Classifier. Perhaps you need to adjust the max depth to be more shallow. (The default max_depth is a full tree). I assume you have many features. – c zl – 2018-07-20T07:30:09.427

4

The default hyper-parameters of the DecisionTreeClassifier allows it to overfit your training data.

The default min_samples_leaf is 1. The default max_depth is None. This combination allows your DecisionTreeClassifier to grow until there is a single data point at each leaf.

Since you are having $100\%$ accuracy, I would assume you have duplicates in your train and test splits. This has nothing to do with the way you split but the way you cleaned your data.

Can you check if you have duplicate datapoints?

x = [[1, 2, 3],
     [4, 5, 6],
     [1, 2, 3]]

y = [1,
     2,
     1]

initial_number_of_data_points = len(x)


def get_unique(X_matrix, y_vector):
    Xy = list(set(list(zip([tuple(x) for x in X_matrix], y_vector))))
    X_matrix = [list(l[0]) for l in Xy]
    y_vector = [l[1] for l in Xy]
    return X_matrix, y_vector


x, y = get_unique(x, y)
data_points_removed = initial_number_of_data_points - len(x)
print("Number of duplicates removed:", data_points_removed )

If you have duplicates in your train and test splits, it is expected to have high accuracies.

Bruno Lubascher

Posted 2018-07-19T08:16:21.663

Reputation: 2 833

2

I believe the problem you are facing is imbalance class problem. You have 99% data belongs to one class. May be the test data you have can be of that class only. Because 99% of the data belong to one class, there is high probability that your model will predict all your test data as that class. To deal with imbalance data you should use AUROC instead of accuracy. And you can use techniques like over sampling and under sampling to make it a balanced data set.

Vikram

Posted 2018-07-19T08:16:21.663

Reputation: 21

1

Please check if you used your test set for building the model. This is a common scenario, like:

Random Forest Classifier gives very high accuracy on test set - overfitting?

If that was the case, everything was making sense. Random Forest was trying not to overfit your model, while a decision tree would just memorize your data as a tree.

SmallChess

Posted 2018-07-19T08:16:21.663

Reputation: 3 050

1

Agree with c zl, in my experience this doesn't sound like a stable model and points to just a random lucky cut of the data. But something that will struggle to provide similar performance on unseen data.

The best models are:

  1. high accuracy on train data
  2. and equally high accuracy on test data
  3. and where both accuracy metrics are not more than 5~10% of each other, which probably shows model stability. The lower difference the better, I feel.

Bootstrapping and k-fold cross validation should usually provide more reliable performance numbers

Dan9ie

Posted 2018-07-19T08:16:21.663

Reputation: 21

1

  1. As already mentioned here, DT are easy to overfit with the default parameters. So RF are usually the better choice compare to DT. Consider them as more generalized.
  2. Why the accuracies are different? RF takes always random variables to be used in algorithm (for single tree) but DT takes all. So, you have a number of features (not big one I suppose) that has a very big influence on the target variable in the whole dataset. Why so? Define what are them and research deeply.
  3. Now, can we say that DT is more stable for this particular task than RF? I'd say no because the situation may change. If your DT algorithm relies on 1-3 important features you can not be sure that these features will play the same significant role in the future.
  4. You can use DT but keep in mind that you have to retrain the model consistently. And compare results to RF.

So, my advices:

  • implement feature importance (you can use RF model to get them) and use only important ones;
  • use some kind of KFold or StratifiedKFold here - it will give you better scores;
  • while you have imbalanced dataset - implement class_weight - should improve RF score;
  • implement GridSearchCV - it will also help you not to overfit (and 1000 trees are too much, tends to overfit too);
  • get train score also too all the time;
  • research important features.

avchauzov

Posted 2018-07-19T08:16:21.663

Reputation: 141

1

I had a similar issue, but I realized that I had included my target variable while predicting the test outcomes.

error:

predict(object = model_nb, test[,])

void of error:

predict(object = model_nb, test[,-16])

where the 16th column was for the dependent variable.

Jeremiah Osibe

Posted 2018-07-19T08:16:21.663

Reputation: 11