5
$\begingroup$

I have been playing with a toy problem to compare the performance and behavior of several scikit-learn classifiers.

Brief, I have one continuous variable X (which contains two samples of size N, each drawn from a distinct normal distributions) and a corresponding label y (either 0 or 1).

X is built as follows:

# Subpopulation 1 s1 = np.random.normal(mu1, sigma1, n1) l1 = np.zeros(n1) # Subpopulation 2 s2 = np.random.normal(mu2, sigma2, n2) l2 = np.ones(n2) # Merge the subpopulations X = np.concatenate((s1, s2), axis=0).reshape(-1, 1) y = np.concatenate((l1, l2)) 

n1, n2: number of data points in each sub-population; mu1, sigma1, mu2, sigma1: mean and standard deviation of each population from which the sample is drawn.

I then split X and y into training and test set:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25) 

And then I fit a series of models, for instance:

from sklearn import svm clf = svm.SVC() # Fit clf.fit(X_train, y_train) 

or, alternatively (full list in the table at the end):

from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier() # Fit rfc.fit(X_train, y_train) 

For all models, I then calculate the accuracy on the training and the test sets. For this I implemented following function:

def apply_model_and_calc_accuracies(model): # Calculate accuracy on training set y_train_hat = model.predict(X_train) a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0] # Calculate accuracy on test set y_test_hat = model.predict(X_test) a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0] # Return accuracies return a_train, a_test 

I compare the algorithms by changing n1, n2, mu1, sigma1, mu2, sigma1 and checking the accuracies of the training and test sets. I initialize the classifiers with their default parameters.

To make a long story short, the Random Forest Classifier always scores 100% accuracy on the test test, no matter what parameters I set.

If, for instance, I test the following parameters:

n1 = n2 = 250 mu1 = mu2 = 7.0 sigma1 = sigma2 = 3.0, 

I merge two completely overlapping subpopulations into X (they still have the correct label y associated to them). My expectation for this experiment is that the various classifiers should be completely guessing, and I would expect a test accuracy of around 50%.

In reality, this is what I get:

 | Algorithm | Train Accuracy % | Test Accuracy % | |----------------------------|------------------|-----------------| | Support Vector Machines | 56.3 | 42.4 | | Logistic Regression | 49.1 | 52.8 | | Stochastic Gradien Descent | 50.1 | 50.4 | | Gaussian Naive Bayes | 50.1 | 52.8 | | Decision Tree | 100.0 | 51.2 | | Random Forest | 100.0 | *100.0* | | Multi-Layer Perceptron | 50.1 | 49.6 | 

I don't understand how this is possible. The Random Forest classifier never sees the test set during training, and still classify with 100% accuracy.

Thanks for any input!

Upon request, I paste my code here (with only two of the originally tested classifiers and less verbose outputs).

import numpy as np import sklearn import matplotlib.pyplot as plt # Seed np.random.seed(42) # Subpopulation 1 n1 = 250 mu1 = 7.0 sigma1 = 3.0 s1 = np.random.normal(mu1, sigma1, n1) l1 = np.zeros(n1) # Subpopulation 2 n2 = 250 mu2 = 7.0 sigma2 = 3.0 s2 = np.random.normal(mu2, sigma2, n2) l2 = np.ones(n2) # Display the data plt.plot(s1, np.zeros(n1), 'r.') plt.plot(s2, np.ones(n1), 'b.') # Merge the subpopulations X = np.concatenate((s1, s2), axis=0).reshape(-1, 1) y = np.concatenate((l1, l2)) # Split in training and test sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25) print(f"Train set contains {X_train.shape[0]} elements; test set contains {X_test.shape[0]} elements.") # Display the test data X_test_0 = X_test[y_test == 0] X_test_1 = X_test[y_test == 1] plt.plot(X_test_0, np.zeros(X_test_0.shape[0]), 'r.') plt.plot(X_test_1, np.ones(X_test_1.shape[0]), 'b.') # Define a commodity function def apply_model_and_calc_accuracies(model): # Calculate accuracy on training set y_train_hat = model.predict(X_train) a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0] # Calculate accuracy on test set y_test_hat = model.predict(X_test) a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0] # Return accuracies return a_train, a_test # Classify # Use Decision Tree from sklearn import tree dtc = tree.DecisionTreeClassifier() # Fit dtc.fit(X_train, y_train) # Calculate accuracy on training and test set a_train_dtc, a_test_dtc = apply_model_and_calc_accuracies(dtc) # Report print(f"Training accuracy = {a_train_dtc}%; test accuracy = {a_test_dtc}%") # Use Random Forest from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier() # Fit rfc.fit(X, y) # Calculate accuracy on training and test set a_train_rfc, a_test_rfc = apply_model_and_calc_accuracies(rfc) # Report print(f"Training accuracy = {a_train_rfc}%; test accuracy = {a_test_rfc}%") 
$\endgroup$
4
  • $\begingroup$ I have a couple of suggestions which might help debug your problem. 1) train a random forest with a low number of estimators, as it should essentially make it a decision tree, and see what happens then 2) you generated overlapping data, but try to create identical data that have both classes $\endgroup$ Commented Apr 6, 2020 at 10:19
  • $\begingroup$ Following your first suggestion, I went from 100 estimators (the default) down to 10, and indeed the test accuracy went down to 96%. With 1 estimator it goes even lower to 86.1%. So, the training (and testing) procedure seem to be correct. I am not completely sure I understood your second point, however. $\endgroup$ Commented Apr 6, 2020 at 10:41
  • $\begingroup$ You use the same parameters to generate your data, but you don't necessarily generate the exact same data. What I mean is create one dataset, label it with 0, then make a copy of it but label it with 1. That way, your model must guess $\endgroup$ Commented Apr 6, 2020 at 11:08
  • 1
    $\begingroup$ Indeed, with two copies of the same sample once labeled with 0 and once with 1, the Random Forest classifier reaches a test accuracy of 43.2%. So everything seems to behave correctly. Now I just need to wrap my head around the idea that the Random Forest classifier can correctly label test examples from two distinct sets coming from the exact same distribution. $\endgroup$ Commented Apr 6, 2020 at 12:51

2 Answers 2

4
$\begingroup$

rfc.fit(X, y) should be rfc.fit(X_train, y_train)

You are simply memorizing the entire dataset with RandomForestClassifier.

$\endgroup$
1
  • 1
    $\begingroup$ Sorry, everyone! $\endgroup$ Commented Apr 8, 2020 at 10:12
3
$\begingroup$

I am debugging your code and I don't get those results, if I copy paste your code and I run it I get:

from sklearn.metrics import accuracy_score accuracy_score(rfc.predict(X_test),y_test) >>>0.488 y_test_hat = rfc.predict(X_test) 100 * sum(y_test == y_test_hat) / y_test.shape[0] >>> 48.8 apply_model_and_calc_accuracies(rfc) >>> (100.0, 48.8) 

Could you share the exact line that you make in order to get those results. It is for sure a debugging error not a conceptual one.

$\endgroup$
4
  • $\begingroup$ After fitting the model, I call my apply_model_and_calc_accuracies(rfc) with the fitted model RandomForestClassifier (rfc). $\endgroup$ Commented Apr 6, 2020 at 13:33
  • 1
    $\begingroup$ @AaronPonti could you provide the full script? For me right now seems fine $\endgroup$ Commented Apr 6, 2020 at 16:07
  • 1
    $\begingroup$ I edited my original post to add a trimmed-down version of the code that shows the problem. $\endgroup$ Commented Apr 7, 2020 at 17:42
  • $\begingroup$ as a result of the DT[Training accuracy = 100.0%; test accuracy = 44.0%] and for the RF[Training accuracy = 93.33333333333333%; test accuracy = 94.4%] which makes completely sense for me $\endgroup$ Commented Apr 7, 2020 at 18:11

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.