0

I'm trying to understand K fold cross validation as I'm using it for the first time for my text classification. However I'm quite confused on how to implement it in python

I have a data frame where data is my text to be predicted and label is the prediction values (0 or 1). I currently used a train test split approach and used Multinomial NB on the vectorized data.

from sklearn import model_selection from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer # split the data into training and testing datasets X_train, X_test, y_train, y_test = model_selection.train_test_split(df['data'], df['label'], random_state=1) vect = CountVectorizer(ngram_range=(1,2), max_features=1000 , stop_words="english") X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) 

I just wanted to know how can I implement a 5 fold validation in a similar way. I looked into a lot of examples but was quite confused how to do it in a right way as I'm a beginner.

1
  • Find the accuracy of 'y_pred_class' by selecting random samples from 'X_test_dtm' instead of all samples at once. Do averaging of that predictions 5 times. The average prediction is the actual accuracy of your trained model which is k fold cross validated where k = 5 Commented Jun 26, 2020 at 9:05

2 Answers 2

2

Just use scikit-learn

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

>>> import numpy as np >>> from sklearn.model_selection import KFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4]) >>> kf = KFold(n_splits=2) >>> kf.get_n_splits(X) 2 >>> print(kf) KFold(random_state=None, shuffle=False) >>> for train_index, test_index in kf.split(X): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [2 3] TEST: [0 1] TRAIN: [0 1] TEST: [2 3] 

Here the n_splits parameter is ommited because the default value is 5 which is what you requested!

I guess this is the easiest way. Always look up at the documentation they provide examples with code, as well as all the parameters their explanation!

Did this help?

EDIT:

The full code would look like this!

from sklearn import model_selection from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import KFold X_train, X_test, y_train, y_test = model_selection.train_test_split(df['data'], df['label'], random_state=1) kf = KFold(n_splits=2) kf.get_n_splits(X_train) for train_index, test_index in kf.split(X): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] vect = CountVectorizer(ngram_range=(1,2), max_features=1000 , stop_words="english") X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) 
Sign up to request clarification or add additional context in comments.

Comments

1

Here is a code sample of how you can use KFold:

X, y = df['data'], df['label'] metrics = [] skf = StratifiedKFold(n_splits=5) for train_index, test_index in skf.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] vect = CountVectorizer(ngram_range=(1,2), max_features=1000 , stop_words="English") X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) metrics.append(accuracy_score(y_test, y_pred_class)) metrics = numpy.array(metrics) print('Mean accuracy: ', numpy.mean(metrics, axis=0)) print('Std for accuracy: ', numpy.std(metrics, axis=0)) 
  • the main idea is that you can measure the model performance by 5 experiments.
  • you can evaluate not only average accuracy but also a standard deviation - as smaller std as better the model.
  • it is better to use StratifiedKFold instead of KFold.

1 Comment

Thank you for the reply and in the for loop it must be train_index and test_index right instead of test and train. I'm able to run the code otherwise successfully

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.