I am using LinearSVM to classify my documents into categories. However, my dataset is unbalanced with some categories having 48,000 documents under them and some as small as 100. When I train my model, even with using Stratified KFold, I see that the category with 48,000 documents get a larger portion of documents(3300) compared to others. In such a case, it would definitely give me biased predictions. How can I make sure this selection isn't biased?
kf=StratifiedKFold(labels, n_folds=10, shuffle=True) for train_index, test_index in kf: X_train, X_test = docs[train_index],docs[test_index] Y_train, Y_test = labels[train_index],labels[test_index] Then I'm writing these(X_train, Y_train) to a file, computing the feature matrix and passing them to the classifier as follows:
model1 = LinearSVC() model1 = model1.fit(matrix, label_tmp) pred = model1.predict(matrix_test) print("Accuracy is:") print(metrics.accuracy_score(label_test, pred)) print(metrics.classification_report(label_test, pred))