Resampling methods Cross Validation Bootstrap Bias and variance estimation with the Bootstrap Three-way data partitioning

Cross-validation Resampling methods Cross Validation Bootstrap Bias and variance estimation with the Bootstrap Three-way data partitioning

Introduction  One may be tempted to use the entire training data to select the “optimal” classifier, then estimate the error rate  This naïve approach has two fundamental problems  The final model will normally overfit the training data: it will not be able to generalize to new data  The problem of overfitting is more pronounced with models that have a large number of parameters  The error rate estimate will be overly optimistic (lower than the true error rate)  In fact, it is not uncommon to have 100% correct classification on training data  The techniques presented in this lecture will allow you to make the best use of your (limited) data for  Training  Model selection and  Performance estimation 2

The holdout method  Split dataset into two groups  Training set: used to train the classifier  Test set: used to estimate the error rate of the trained classifier Total number of examples Training Set Test Set  The holdout method has two basic drawbacks  In problems where we have a sparse dataset we may not be able to afford the “luxury” of setting aside a portion of the dataset for testing  Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an “unfortunate” split  The limitations of the holdout can be overcome with a family of resampling methods at the expense of higher computational cost  Cross Validation  Random Subsampling  K-Fold Cross-Validation  Leave-one-out Cross-Validation  Bootstrap

Random Subsampling Random Subsampling performs K data splits of the entire dataset  Each data split randomly selects a (fixed) number of examples without replacement  For each data split we retrain the classifier from scratch with the training examples and then estimate Ei with the test examples The true error estimate is obtained as the average of the separate estimates Ei  This estimate is significantly better than the holdout estimate 4  Total number of examples Experiment 1 Experiment 2 Experiment 3 Test example

K-Fold Cross-validation  Create a K-fold partition of the the dataset  For each of K experiments, use K-1 folds for training and the other one (remaining) fold for testing  This procedure is illustrated in the following figure for K=4  K-Fold Cross validation is similar to Random Subsampling  The advantage of K-Fold Cross validation is that all the examples in the dataset are eventually used for both training and testing  As before, the true error is estimated as the average error rate on test examples Total number of examples Experiment 1 Experiment 2 Experiment 3 Experiment 4 Test examples

Leave-one-out Cross Validation  Leave-one-out is the degenerate case of K-Fold Cross Validation, where K is chosen as the total number of examples For a dataset with N examples, perform N experiments For each experiment use N-1 examples for training and the remaining example for testing  As usual, the true error is estimated as the average error rate on test examples Total number of examples Experiment 1 Experiment 2 Experiment 3 Experiment N

7 How many folds are needed? With a large number of folds + The bias of the true error rate estimator will be small (the estimator will be very accurate) - The variance of the true error rate estimator will be large - The computational time will be very large as well (many experiments)  With a small number of folds + The number of experiments and, therefore, computation time are reduced + The variance of the estimator will be small - The bias of the estimator will be large (conservative or smaller than the true error rate) In practice, the choice of the number of folds depends on the size of the dataset  For large datasets, even 3-Fold Cross Validation will be quite accurate  For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible A common choice for K-Fold Cross Validation is K=10 or k=5

The bootstrap (1) The bootstrap is a resampling technique with replacement From a dataset with N examples Randomly select (with replacement) N examples and use this set for training The remaining examples that were not selected for training are used for testing  This value is likely to change from fold to fold Repeat this process for a specified number of folds (K) As before, the true error is estimated as the average error rate on test examples 8

9 The bootstrap (2)  Compared to basic cross-validation, the bootstrap increases the variance that can occur in each fold [Efron and Tibshirani, 1993]  This is a desirable property since it is a more realistic simulation of the real-life experiment from which our dataset was obtained  Consider a classification problem with C classes, a total of N examples and Ni examples for each class ωi The a priori probability of choosing an example from class ωi is Ni/N Once we choose an example from class ωi, if we do not replace it for the next selection, then the a priori probabilities will have changed since the probability of choosing an example from class ωi will now be (Ni-1)/N Thus, sampling with replacement preserves the a priori probabilities of the classes throughout the random selection process An additional benefit of the bootstrap is its ability to obtain accurate measures of BOTH the bias and variance of the true error estimate

Three-way data splits (1)  If model selection and true error estimates are to be computed simultaneously, the data needs to be divided into three disjoint sets [Ripley, 1996]  Training set: a set of examples used for learning: to fit the parameters of the classifier  In the MLP case, we would use the training set to find the “optimal” weights with the back-prop rule  Validation set: a set of examples used to tune the parameters of the algorithm e.g. a classifier.  It also is used to compare the performances of the prediction algorithms that were created based on the training set. We choose the algorithm that has the best performance  In the MLP case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm. Use the algorithm with best parameters.  Test set: a set of examples used only to assess the performance of a fully-trained classifier  In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights)  After assessing the final model on the test set, YOU MUST NOT tune the model any further!  Why separate test and validation sets?  The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model  After assessing the final model on the test set, YOU MUST NOT tune the model any further! 0

11  Procedure outline 1. Divide the available data into training, validation and test set. 2. Select architecture and training parameters. 3. Train the model using the training set. 4. Evaluate the model using the validation set. 5. Repeat steps 2 through 4 using different architectures and training parameters. 6 Select the best model and train it using data from the training and validation sets. 7.Assess this final model using the test set.  This outline assumes a holdout method  If Cross-Validation or Bootstrap are used, steps 3 and 4 have to be repeated for each of the K folds

Variance bias tradeoff. MSE = Bias2 + Variance Variance bias tradeoff: Training, testing

Resampling methods Cross Validation Bootstrap Bias and variance estimation with the Bootstrap Three-way data partitioning

More Related Content

Similar to Resampling methods Cross Validation Bootstrap Bias and variance estimation with the Bootstrap Three-way data partitioning

More from ssuser1028f8

Recently uploaded

Resampling methods Cross Validation Bootstrap Bias and variance estimation with the Bootstrap Three-way data partitioning