Applying K fold validation for text classification

Question

I'm trying to understand K fold cross validation as I'm using it for the first time for my text classification. However I'm quite confused on how to implement it in python

I have a data frame where data is my text to be predicted and label is the prediction values (0 or 1). I currently used a train test split approach and used Multinomial NB on the vectorized data.

from sklearn import model_selection from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer # split the data into training and testing datasets X_train, X_test, y_train, y_test = model_selection.train_test_split(df['data'], df['label'], random_state=1) vect = CountVectorizer(ngram_range=(1,2), max_features=1000 , stop_words="english") X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm)

I just wanted to know how can I implement a 5 fold validation in a similar way. I looked into a lot of examples but was quite confused how to do it in a right way as I'm a beginner.

Find the accuracy of 'y_pred_class' by selecting random samples from 'X_test_dtm' instead of all samples at once. Do averaging of that predictions 5 times. The average prediction is the actual accuracy of your trained model which is k fold cross validated where k = 5 — Raady
– Raady, Commented Jun 26, 2020 at 9:05

José Rodrigues · Accepted Answer · 2020-06-26 09:20:38Z

Just use scikit-learn

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

>>> import numpy as np >>> from sklearn.model_selection import KFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4]) >>> kf = KFold(n_splits=2) >>> kf.get_n_splits(X) 2 >>> print(kf) KFold(random_state=None, shuffle=False) >>> for train_index, test_index in kf.split(X): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [2 3] TEST: [0 1] TRAIN: [0 1] TEST: [2 3]

Here the n_splits parameter is ommited because the default value is 5 which is what you requested!

I guess this is the easiest way. Always look up at the documentation they provide examples with code, as well as all the parameters their explanation!

Did this help?

EDIT:

The full code would look like this!

from sklearn import model_selection from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import KFold X_train, X_test, y_train, y_test = model_selection.train_test_split(df['data'], df['label'], random_state=1) kf = KFold(n_splits=2) kf.get_n_splits(X_train) for train_index, test_index in kf.split(X): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] vect = CountVectorizer(ngram_range=(1,2), max_features=1000 , stop_words="english") X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm)

Danylo Baibak · Accepted Answer · 2020-06-26 12:17:51Z

Here is a code sample of how you can use KFold:

X, y = df['data'], df['label'] metrics = [] skf = StratifiedKFold(n_splits=5) for train_index, test_index in skf.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] vect = CountVectorizer(ngram_range=(1,2), max_features=1000 , stop_words="English") X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) metrics.append(accuracy_score(y_test, y_pred_class)) metrics = numpy.array(metrics) print('Mean accuracy: ', numpy.mean(metrics, axis=0)) print('Std for accuracy: ', numpy.std(metrics, axis=0))

the main idea is that you can measure the model performance by 5 experiments.
you can evaluate not only average accuracy but also a standard deviation - as smaller std as better the model.
it is better to use StratifiedKFold instead of KFold.

Thank you for the reply and in the for loop it must be train_index and test_index right instead of test and train. I'm able to run the code otherwise successfully

Collectives™ on Stack Overflow

Applying K fold validation for text classification

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related