4

1

I have been playing with a toy problem to compare the performance and behavior of several scikit-learn classifiers.

Brief, I have one continuous variable X (which contains two samples of size N, each drawn from a distinct normal distributions) and a corresponding label y (either 0 or 1).

X is built as follows:

```
# Subpopulation 1
s1 = np.random.normal(mu1, sigma1, n1)
l1 = np.zeros(n1)
# Subpopulation 2
s2 = np.random.normal(mu2, sigma2, n2)
l2 = np.ones(n2)
# Merge the subpopulations
X = np.concatenate((s1, s2), axis=0).reshape(-1, 1)
y = np.concatenate((l1, l2))
```

`n1`

, `n2`

: number of data points in each sub-population;
`mu1`

, `sigma1`

, `mu2`

, `sigma1`

: mean and standard deviation of each population from which the sample is drawn.

I then split `X`

and `y`

into training and test set:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
```

And then I fit a series of models, for instance:

```
from sklearn import svm
clf = svm.SVC()
# Fit
clf.fit(X_train, y_train)
```

or, alternatively (full list in the table at the end):

```
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
# Fit
rfc.fit(X_train, y_train)
```

For all models, I then calculate the accuracy on the training and the test sets. For this I implemented following function:

```
def apply_model_and_calc_accuracies(model):
# Calculate accuracy on training set
y_train_hat = model.predict(X_train)
a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0]
# Calculate accuracy on test set
y_test_hat = model.predict(X_test)
a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0]
# Return accuracies
return a_train, a_test
```

I compare the algorithms by changing `n1`

, `n2`

, `mu1`

, `sigma1`

, `mu2`

, `sigma1`

and checking the accuracies of the training and test sets. I initialize the classifiers with their default parameters.

To make a long story short, the Random Forest Classifier always scores 100% accuracy on the test test, no matter what parameters I set.

If, for instance, I test the following parameters:

```
n1 = n2 = 250
mu1 = mu2 = 7.0
sigma1 = sigma2 = 3.0,
```

I merge two completely overlapping subpopulations into X (they still have the correct label y associated to them). My expectation for this experiment is that the various classifiers should be completely guessing, and I would expect a test accuracy of around 50%.

In reality, this is what I get:

| Algorithm | Train Accuracy % | Test Accuracy % | |----------------------------|------------------|-----------------| | Support Vector Machines | 56.3 | 42.4 | | Logistic Regression | 49.1 | 52.8 | | Stochastic Gradien Descent | 50.1 | 50.4 | | Gaussian Naive Bayes | 50.1 | 52.8 | | Decision Tree | 100.0 | 51.2 | | Random Forest | 100.0 | *100.0* | | Multi-Layer Perceptron | 50.1 | 49.6 |

I don't understand how this is possible. The Random Forest classifier never sees the test set during training, and still classify with 100% accuracy.

Thanks for any input!

Upon request, I paste my code here (with only two of the originally tested classifiers and less verbose outputs).

```
import numpy as np
import sklearn
import matplotlib.pyplot as plt
# Seed
np.random.seed(42)
# Subpopulation 1
n1 = 250
mu1 = 7.0
sigma1 = 3.0
s1 = np.random.normal(mu1, sigma1, n1)
l1 = np.zeros(n1)
# Subpopulation 2
n2 = 250
mu2 = 7.0
sigma2 = 3.0
s2 = np.random.normal(mu2, sigma2, n2)
l2 = np.ones(n2)
# Display the data
plt.plot(s1, np.zeros(n1), 'r.')
plt.plot(s2, np.ones(n1), 'b.')
# Merge the subpopulations
X = np.concatenate((s1, s2), axis=0).reshape(-1, 1)
y = np.concatenate((l1, l2))
# Split in training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
print(f"Train set contains {X_train.shape[0]} elements; test set contains {X_test.shape[0]} elements.")
# Display the test data
X_test_0 = X_test[y_test == 0]
X_test_1 = X_test[y_test == 1]
plt.plot(X_test_0, np.zeros(X_test_0.shape[0]), 'r.')
plt.plot(X_test_1, np.ones(X_test_1.shape[0]), 'b.')
# Define a commodity function
def apply_model_and_calc_accuracies(model):
# Calculate accuracy on training set
y_train_hat = model.predict(X_train)
a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0]
# Calculate accuracy on test set
y_test_hat = model.predict(X_test)
a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0]
# Return accuracies
return a_train, a_test
# Classify
# Use Decision Tree
from sklearn import tree
dtc = tree.DecisionTreeClassifier()
# Fit
dtc.fit(X_train, y_train)
# Calculate accuracy on training and test set
a_train_dtc, a_test_dtc = apply_model_and_calc_accuracies(dtc)
# Report
print(f"Training accuracy = {a_train_dtc}%; test accuracy = {a_test_dtc}%")
# Use Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
# Fit
rfc.fit(X, y)
# Calculate accuracy on training and test set
a_train_rfc, a_test_rfc = apply_model_and_calc_accuracies(rfc)
# Report
print(f"Training accuracy = {a_train_rfc}%; test accuracy = {a_test_rfc}%")
```

I have a couple of suggestions which might help debug your problem. 1) train a random forest with a low number of estimators, as it should essentially make it a decision tree, and see what happens then 2) you generated overlapping data, but try to create identical data that have both classes – Valentin Calomme – 2020-04-06T10:19:14.013

Following your first suggestion, I went from 100 estimators (the default) down to 10, and indeed the test accuracy went down to 96%. With 1 estimator it goes even lower to 86.1%. So, the training (and testing) procedure seem to be correct. I am not completely sure I understood your second point, however. – Aaron Ponti – 2020-04-06T10:41:18.100

You use the same parameters to generate your data, but you don't necessarily generate the exact same data. What I mean is create one dataset, label it with 0, then make a copy of it but label it with 1. That way, your model must guess – Valentin Calomme – 2020-04-06T11:08:23.840

1Indeed, with two copies of the same sample once labeled with 0 and once with 1, the Random Forest classifier reaches a test accuracy of 43.2%. So everything seems to behave correctly. Now I just need to wrap my head around the idea that the Random Forest classifier can correctly label test examples from two distinct sets coming from the exact same distribution. – Aaron Ponti – 2020-04-06T12:51:32.650