6
$\begingroup$

In the Coursera video lecture by Prof. Andrew Ng, he discusses about some basic good practices in Machine Learning. At the time stamp of around 11mins, in this video lecture, https://www.youtube.com/watch?v=ISBGFY-gBug the learning curve is shown which is a plot of cross-validation error and training error vs the size of the training set. I am doing the k fold cross-validation method for hyperparameter tuning and model selection.

In this scenario,

  • consider the variable Xdata to be the entire feature set which is split into training set, DataTrain that is used in the k fold setup and is further split into training subset and validation subset.
  • So, using DataTrain we have trainData and testData for the k fold setup.
  • Then there is an independent test set, denoted by the variable DataTest.

    When using k fold cross validation method, to plot the learning curve, would training error be the misclassification error on DataTrain and cross-validation error be the misclassification error using the validation subset, testData?

$\endgroup$

1 Answer 1

4
$\begingroup$

When using k fold cross validation method, to plot the learning curve, would training error be the misclassification error on DataTrain and cross-validation error be the misclassification error using the validation subset, testData?

No

  • The training error would be the average, over the K-folds, of the error on the trainData.

  • The test error would be the average, over the K-folds, of the error on testData

Remember that for each fold, the datasets trainData and testData are different.


Source:

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size

$\endgroup$
2
  • $\begingroup$ thank you for your answer and the links. Can you please say how to calculate the variance which is often reported for k fold cross validation? Is it a scalar value of variance of the misclassification errors using testData test fold subset? $\endgroup$ Commented Jul 21, 2018 at 17:33
  • $\begingroup$ You-ve got to be careful with what you mean by variance... there are raging debates on this site about the theory behind variance for k-fold cross validation.. If you want to reproduce the standard deviation fill between plots as seen sklearn website in the link, then you compute the standard deviation of the K training errors (i.e. of each fold) - but this isn't really the variance of the CV estimator, it's the variance across the K folds see here stats.stackexchange.com/questions/61783/… $\endgroup$ Commented Jul 21, 2018 at 17:44

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.