Python scikit learn KFold function uneven train, test split

Question

i have the following code below where i have noticed that the length of the train, test split from Kfold.split() is different for the last fold. Any reason why this may be happening and how i can go around it?, Thanks.

from sklearn.model_selection import KFold data = np.arange(0,47, 1) kfold = KFold(6) # init for 6 fold cross validation for train, test in kfold.split(data): # split data into train and test print("train size:",len(train), "test size:",len(test))

Oxbowerce · Accepted Answer · 2021-11-02 11:13:25Z

What behaviour would you expect? This difference is caused simply by the fact that the number of samples cannot be evenly distributed over the number of folds you provided. You have 47 samples in your dataset and want to split this into 6 folds for cross validation. $47 / 6 = 7 \frac{5}{6}$, which would mean that the test dataset in each fold would contain $7 \frac{5}{6}$ samples, which is impossible since only complete samples can be included. As a result you will see that 5 out of 6 times the test set will contain 8 samples and 1 out of 6 times the test set will contain a single sample to get to an average of $7 \frac{5}{6}$ samples in your test set: $\frac{5}{6} * 8 + \frac{1}{6} * 7 = 7 \frac{5}{6}$. If you increase the number of samples in your dataset to a number divisible by 6 (e.g. 48), you will see that the number of samples in the test set will stay the same since dividing 48 by 6 will give a whole number instead of a decimal number.

from sklearn.model_selection import KFold import numpy as np data = np.arange(0,48, 1) kfold = KFold(6) for train, test in kfold.split(data): print("train size:",len(train), "test size:",len(test)) # train size: 40 test size: 8 # train size: 40 test size: 8 # train size: 40 test size: 8 # train size: 40 test size: 8 # train size: 40 test size: 8 # train size: 40 test size: 8

Stack Exchange Network

Python scikit learn KFold function uneven train, test split

1 Answer 1

Hot Network Questions

Python scikit learn KFold function uneven train, test split

1 Answer 1

Related

Hot Network Questions