i am using scikit-learn to classify my data, at the moment i am running a simple DecisionTree classifier. I have three classes with a big imbalanced problem. The classes are 0,1 and 2. The minor classes are 1 and 2.
To give you an idea about the number of samples of the classes:
0 = 25.000 samples 1 = 15/20 less or more 2 = 15/20 less or more so minor classes are about 0.06% of the dataset. The approach that i am following to solve the imbalance problem is the UPSAMPLING of the minor classes. Code:
from sklearn.utils import resample, resample(data, replace=True, n_samples=len_major_class, random_state=1234) Now comes the problem. I did two tests:
- If I upsample the minor classes and then divide my dataset in two groups one for training and one for testing... the accuracy is:
precision recall f1-score support 0 1.00 1.00 1.00 20570 1 1.00 1.00 1.00 20533 2 1.00 1.00 1.00 20439 avg / total 1.00 1.00 1.00 61542
very good result.
- If I ONLY upsample the training data and leave the original data for testing, the result is:
precision recall f1-score support 0 1.00 1.00 1.00 20570 1 0.00 0.00 0.00 15 2 0.00 0.00 0.00 16 avg / total 1.00 1.00 1.00 20601
as you can see the global accuracy is high, but the accuracy of the class 1 and 2 is zero.
I am creating the classifier in this way:
DecisionTreeClassifier(max_depth=20, max_features=0.4, random_state=1234, criterion='entropy') I also have tried adding the class_weight with balanced value, but it makes no difference.
I should only upsample the training data, why am i getting this strange problem?