scikit-learn RandomForestClassifier always hits 100% test accuracy

4

1

I have been playing with a toy problem to compare the performance and behavior of several scikit-learn classifiers.

Brief, I have one continuous variable X (which contains two samples of size N, each drawn from a distinct normal distributions) and a corresponding label y (either 0 or 1).

X is built as follows:

# Subpopulation 1
s1 = np.random.normal(mu1, sigma1, n1)
l1 = np.zeros(n1)

# Subpopulation 2
s2 = np.random.normal(mu2, sigma2, n2)
l2 = np.ones(n2)

# Merge the subpopulations
X = np.concatenate((s1, s2), axis=0).reshape(-1, 1)
y = np.concatenate((l1, l2))

n1, n2: number of data points in each sub-population; mu1, sigma1, mu2, sigma1: mean and standard deviation of each population from which the sample is drawn.

I then split X and y into training and test set:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

And then I fit a series of models, for instance:

from sklearn import svm
clf = svm.SVC()

# Fit
clf.fit(X_train, y_train)

or, alternatively (full list in the table at the end):

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

# Fit
rfc.fit(X_train, y_train)

For all models, I then calculate the accuracy on the training and the test sets. For this I implemented following function:

def apply_model_and_calc_accuracies(model):
    # Calculate accuracy on training set
    y_train_hat = model.predict(X_train)
    a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0]
    # Calculate accuracy on test set
    y_test_hat = model.predict(X_test)
    a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0]
    # Return accuracies
    return a_train, a_test

I compare the algorithms by changing n1, n2, mu1, sigma1, mu2, sigma1 and checking the accuracies of the training and test sets. I initialize the classifiers with their default parameters.

To make a long story short, the Random Forest Classifier always scores 100% accuracy on the test test, no matter what parameters I set.

If, for instance, I test the following parameters:

n1 = n2 = 250
mu1 = mu2 = 7.0
sigma1 = sigma2 = 3.0,

I merge two completely overlapping subpopulations into X (they still have the correct label y associated to them). My expectation for this experiment is that the various classifiers should be completely guessing, and I would expect a test accuracy of around 50%.

In reality, this is what I get:

| Algorithm                  | Train Accuracy % | Test Accuracy % |
|----------------------------|------------------|-----------------|
| Support Vector Machines    |  56.3            |  42.4           |
| Logistic Regression        |  49.1            |  52.8           |
| Stochastic Gradien Descent |  50.1            |  50.4           |
| Gaussian Naive Bayes       |  50.1            |  52.8           |
| Decision Tree              | 100.0            |  51.2           |
| Random Forest              | 100.0            | *100.0*         |
| Multi-Layer Perceptron     |  50.1            |  49.6           |

I don't understand how this is possible. The Random Forest classifier never sees the test set during training, and still classify with 100% accuracy.

Thanks for any input!

Upon request, I paste my code here (with only two of the originally tested classifiers and less verbose outputs).

import numpy as np
import sklearn
import matplotlib.pyplot as plt

# Seed
np.random.seed(42)

# Subpopulation 1
n1 = 250
mu1 = 7.0
sigma1 = 3.0
s1 = np.random.normal(mu1, sigma1, n1)
l1 = np.zeros(n1)

# Subpopulation 2
n2 = 250
mu2 = 7.0
sigma2 = 3.0
s2 = np.random.normal(mu2, sigma2, n2)
l2 = np.ones(n2)

# Display the data
plt.plot(s1, np.zeros(n1), 'r.')
plt.plot(s2, np.ones(n1), 'b.')

# Merge the subpopulations
X = np.concatenate((s1, s2), axis=0).reshape(-1, 1)
y = np.concatenate((l1, l2))

# Split in training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
print(f"Train set contains {X_train.shape[0]} elements; test set contains {X_test.shape[0]} elements.")

# Display the test data
X_test_0 = X_test[y_test == 0]
X_test_1 = X_test[y_test == 1]
plt.plot(X_test_0, np.zeros(X_test_0.shape[0]), 'r.')
plt.plot(X_test_1, np.ones(X_test_1.shape[0]), 'b.')

# Define a commodity function
def apply_model_and_calc_accuracies(model):
    # Calculate accuracy on training set
    y_train_hat = model.predict(X_train)
    a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0]
    # Calculate accuracy on test set
    y_test_hat = model.predict(X_test)
    a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0]
    # Return accuracies
    return a_train, a_test

# Classify

# Use Decision Tree
from sklearn import tree
dtc = tree.DecisionTreeClassifier()

# Fit
dtc.fit(X_train, y_train)

# Calculate accuracy on training and test set
a_train_dtc, a_test_dtc = apply_model_and_calc_accuracies(dtc)

# Report
print(f"Training accuracy = {a_train_dtc}%; test accuracy = {a_test_dtc}%")

# Use Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

# Fit
rfc.fit(X, y)

# Calculate accuracy on training and test set
a_train_rfc, a_test_rfc = apply_model_and_calc_accuracies(rfc)

# Report
print(f"Training accuracy = {a_train_rfc}%; test accuracy = {a_test_rfc}%")

Aaron Ponti

Posted 2020-04-06T09:49:52.837

Reputation: 43

I have a couple of suggestions which might help debug your problem. 1) train a random forest with a low number of estimators, as it should essentially make it a decision tree, and see what happens then 2) you generated overlapping data, but try to create identical data that have both classes – Valentin Calomme – 2020-04-06T10:19:14.013

Following your first suggestion, I went from 100 estimators (the default) down to 10, and indeed the test accuracy went down to 96%. With 1 estimator it goes even lower to 86.1%. So, the training (and testing) procedure seem to be correct. I am not completely sure I understood your second point, however. – Aaron Ponti – 2020-04-06T10:41:18.100

You use the same parameters to generate your data, but you don't necessarily generate the exact same data. What I mean is create one dataset, label it with 0, then make a copy of it but label it with 1. That way, your model must guess – Valentin Calomme – 2020-04-06T11:08:23.840

1Indeed, with two copies of the same sample once labeled with 0 and once with 1, the Random Forest classifier reaches a test accuracy of 43.2%. So everything seems to behave correctly. Now I just need to wrap my head around the idea that the Random Forest classifier can correctly label test examples from two distinct sets coming from the exact same distribution. – Aaron Ponti – 2020-04-06T12:51:32.650

Answers

4

rfc.fit(X, y) should be rfc.fit(X_train, y_train)

You are simply memorizing the entire dataset with RandomForestClassifier.

MrMulliner

Posted 2020-04-06T09:49:52.837

Reputation: 116

1Sorry, everyone! – Aaron Ponti – 2020-04-08T10:12:54.320

3

I am debugging your code and I don't get those results, if I copy paste your code and I run it I get:

from sklearn.metrics import accuracy_score
accuracy_score(rfc.predict(X_test),y_test)

>>>0.488

y_test_hat = rfc.predict(X_test)
100 * sum(y_test == y_test_hat) / y_test.shape[0]
>>> 48.8

apply_model_and_calc_accuracies(rfc)
>>> (100.0, 48.8)

Could you share the exact line that you make in order to get those results. It is for sure a debugging error not a conceptual one.

Carlos Mougan

Posted 2020-04-06T09:49:52.837

Reputation: 4 420

After fitting the model, I call my apply_model_and_calc_accuracies(rfc) with the fitted model RandomForestClassifier (rfc). – Aaron Ponti – 2020-04-06T13:33:42.830

1@AaronPonti could you provide the full script? For me right now seems fine – Carlos Mougan – 2020-04-06T16:07:43.767

1I edited my original post to add a trimmed-down version of the code that shows the problem. – Aaron Ponti – 2020-04-07T17:42:55.367

as a result of the DT[Training accuracy = 100.0%; test accuracy = 44.0%] and for the RF[Training accuracy = 93.33333333333333%; test accuracy = 94.4%] which makes completely sense for me – Carlos Mougan – 2020-04-07T18:11:06.927