0

I am using LinearSVM to classify my documents into categories. However, my dataset is unbalanced with some categories having 48,000 documents under them and some as small as 100. When I train my model, even with using Stratified KFold, I see that the category with 48,000 documents get a larger portion of documents(3300) compared to others. In such a case, it would definitely give me biased predictions. How can I make sure this selection isn't biased?

kf=StratifiedKFold(labels, n_folds=10, shuffle=True) for train_index, test_index in kf: X_train, X_test = docs[train_index],docs[test_index] Y_train, Y_test = labels[train_index],labels[test_index] 

Then I'm writing these(X_train, Y_train) to a file, computing the feature matrix and passing them to the classifier as follows:

model1 = LinearSVC() model1 = model1.fit(matrix, label_tmp) pred = model1.predict(matrix_test) print("Accuracy is:") print(metrics.accuracy_score(label_test, pred)) print(metrics.classification_report(label_test, pred)) 

1 Answer 1

1

The StratifiedKFold method by default takes into account the ratio of labels in all your classes, meaning that each fold will have the exact (or close to exact) ratio of each label in that sample. Whether you want to adjust for this or not is somewhat up to you - you can either let the classifier learn some kind of bias for labels with more samples (as you are now), or you can do one of two things:

  1. Construct a separate train / test set, where the training set has equal number of samples in each label (therefore in your case, each class label in the training set might only have 50 examples, which is not ideal). Then you can train on your training set and test on the rest. If you do this multiple times with different samples, you are essentially doing k-fold cross validation, just choosing your sample sizes in a different way.

  2. You can change your loss function (i.e. the way you initialize LinearSVC() to account for the class imbalances. For example: model = LinearSVC(class_weight='balanced'). This will cause the model to learn a loss function that takes class imbalances into account.

Sign up to request clarification or add additional context in comments.

2 Comments

Hi, I think balanced isn't supported by my scikit version, so I set it to 'auto' instead. So I see that the accuracy has dropped right from 70%(earlier) to 53% now. Is it a disadvantage then ?
(1) You should update your scikit-learn version. No reason to use an older one. If you read the docs, 'auto' turned into 'balanced' in the new version. (2) The accuracy drop is expected. Think of this scenario: you have 9 examples of class A and one example of class B. If your classifier always guesses that a sample belongs to class A, your accuracy is 90%. If you learn a balanced classifier, you might actually guess things as belonging to both A and B, but you might get more wrong. As a result, accuracy is not the right metric to use - look into precision, or average precision instead.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.