2
$\begingroup$

I understand that using 100% of the dataset and doing k-fold cross validation instead of train_test_split would eliminate that randomness the latter method have in splitting and thus potentially avoid overfitting.

But I have seen that this is not the best practice when it comes to k-fold CV. What I have seen is that we should split the dataset (say 80% training, 20% testing), and then perform k-fold CV on the 80% training (Image below).

What I am confused about is that with this method, we are again potenially falling into a random split by using train_test_split before CV. Why is this method then considered best practice generally ? What am I missing ?

CV

$\endgroup$

3 Answers 3

1
$\begingroup$

You can mend this by doing nested CV.

However... since the test data is only used to assess final model performance, not for deciding anything, one usually goes for the simple split.

$\endgroup$
2
  • $\begingroup$ It is not the different variations of CV that I am confused with. I am just wondering why would one NOT use 100% of the dataset to do CV and instead go for a train_test_split then CV potenially risking an unbalanced distribution $\endgroup$ Commented Nov 4, 2022 at 23:29
  • 1
    $\begingroup$ @MxML cause essentially the model itself is a hyperparameter to tune. So when we are comparing models, we need a "second" validation procedure. In general, even with a single model, doing a nested CV is not wrong. As Michael says (+1), for a single model, using a single test fold/split is mostly fine. Ideally actually we can compared the CV-procedure error to the test fold one and it should be relatively close to each other. $\endgroup$ Commented Nov 4, 2022 at 23:52
1
$\begingroup$

K-fold cross validation is used to determine the general fit of a model for a modelling task. It is especially useful when the amount of data is limited.

Once you determine which model is best for your problem via k-fold CV, you will train the chosen model on the entire training set, and then test the model on a dataset that was never used during the modelling process. The test set should only be used once to prevent bias. This is why you need to split the data into a training and test set before k-fold CV. Otherwise you run into the problem of having a biased model evaluation.

$\endgroup$
0
$\begingroup$

Generally we would prefer to have more data, not less. So we need to make efficient use of our data. That can leave you with smallish test datasets. If you have one 20% test dataset you do not know, how precisely that describes the predictive value. If you have 5 different 20% test data Sets you can get a rough idea of whether these lead to 5 similar or 5 diverging result. There is always leave-one-out on the other extreme but that gets computationally challenging very fast.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.