Up Sampling imbalanced dataset's minor classes

Question

i am using scikit-learn to classify my data, at the moment i am running a simple DecisionTree classifier. I have three classes with a big imbalanced problem. The classes are 0,1 and 2. The minor classes are 1 and 2.

To give you an idea about the number of samples of the classes:

0 = 25.000 samples 1 = 15/20 less or more 2 = 15/20 less or more

so minor classes are about 0.06% of the dataset. The approach that i am following to solve the imbalance problem is the UPSAMPLING of the minor classes. Code:

from sklearn.utils import resample, resample(data, replace=True, n_samples=len_major_class, random_state=1234)

Now comes the problem. I did two tests:

If I upsample the minor classes and then divide my dataset in two groups one for training and one for testing... the accuracy is:

 precision recall f1-score support 0 1.00 1.00 1.00 20570 1 1.00 1.00 1.00 20533 2 1.00 1.00 1.00 20439 avg / total 1.00 1.00 1.00 61542

very good result.

If I ONLY upsample the training data and leave the original data for testing, the result is:

 precision recall f1-score support 0 1.00 1.00 1.00 20570 1 0.00 0.00 0.00 15 2 0.00 0.00 0.00 16 avg / total 1.00 1.00 1.00 20601

as you can see the global accuracy is high, but the accuracy of the class 1 and 2 is zero.

I am creating the classifier in this way:

DecisionTreeClassifier(max_depth=20, max_features=0.4, random_state=1234, criterion='entropy')

I also have tried adding the class_weight with balanced value, but it makes no difference.

I should only upsample the training data, why am i getting this strange problem?

Roberto · Accepted Answer · 2018-11-09 07:07:20Z

The fact that you obtain that behavior is quite normal when you do the re-sampling before the splitting; you are inducing a bias in your data.

If you oversample the data and then split, the minority samples in the test won't be anymore independent from the samples in the training set because they are generated together. In your case they are exact copies of the samples in the training set. Your accuracy is 100% because the classifier is classifying samples that have already been seen in the training.

Since your problem is strongly umbalanced I would suggest to use an ensemble of classifiers to handle it. 1) Split your dataset in training set and test set. Given the size of the dataset you can sample 1-2 samples from the minority class for test and leave the other for training. 2) From the training you generate N datasets containing all the remaining samples of the minority class and under-samples from the majority class (i would say 2*number of samples in the minority class). 3) For each one of the dataset obtained you train a model. 4) Use the test set to obtain the prediction; the final prediction will be the results of a majority vote of all the predictions of the classifiers.

To have robust metrics perform different iterations with different initial splitting test/training.

Venkatachalam · Accepted Answer · 2018-11-09 07:04:08Z

You should not split the dataset after upsampling. You can do the upsampling within the training data.

Basically, you are leaking the test data into the training data.

E.G. Cortes · Accepted Answer · 2021-02-19 19:21:19Z

I have a function that resamples the dataset for each class to have the same amount of instance.

from sklearn.utils import resample import pandas as pd def make_resample(_df, column): dfs_r = {} dfs_c = {} bigger = 0 ignore = "" for c in _df[column].unique(): dfs_c[c] = _df[df[column] == c] if dfs_c[c].shape[0] > bigger: bigger = dfs_c[c].shape[0] ignore = c for c in dfs_c: if c == ignore: continue dfs_r[c] = resample(dfs_c[c], replace=True, n_samples=bigger - dfs_c[c].shape[0], random_state=0) return pd.concat([dfs_r[c] for c in dfs_r] + [_df])

Collectives™ on Stack Overflow

Up Sampling imbalanced dataset's minor classes

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related