How to choose train/test sample ratio, for machine learning?

Question

I am building a real time machine learning module, which is not based on a huge** sample size, with hyper parameter grid search and cross validation process. I am looking for any insight/advice, as I`m considering one of these options:

Use cross validation grid search to look for the best hyper-parameter (HP) fit, and once I found the best HP combination, use it to re-train my classifier on the whole sample set.
Break in advance my training set to 2 sub-set, use cross validation by iteratively re-break my training set into train/test sets which searching for the best HP combination, and use the trained classifier with the best HP without going through the last re-train stage described in 1.
Do the same as 1, but keep the same random seed as I re-train the classifier.

The trade off, as I see it, is between getting a bit more sample size for training, while at the same time loosing assurance that the performance will be good and I`m not overfilling.

Again any though/insight into my dilemma are welcome.

*Note that I`m using random-forest and extra trees classifiers during grid search.

**My sample size is typically between a few hundreds and a few thousands for each class, and the number of features is between 70 and 1500, typically.

By the way if you are using random forest you don't need to recode your features into dummies. Just use them as is. — JEquihua
– JEquihua, Commented Jan 9, 2014 at 16:04

JEquihua · Accepted Answer · 2014-02-03 05:29:29Z

I think you are describing nested cross validation and you can use it to select your best hyperparameters. R already has some packages implementing this, for example for support vector machines you could use the package e1071 and do something like this assuming you have two independent variables:

svmTuning <- tune.svm(Y~X1+X2.,type="nu-regression",kernel="radial", data = dat, gamma = seq(from=0,to=3,by=0.1), cost=seq(from=2,to=16,by=2), tunecontrol= tune.control(sampling="cross",cross=1000))

If you had 1000 observations the previous would perform leave-one-out cross validation, sweeping through the possible combinations of selected gammas and costs (but only one kernel in this case). You can see the best parameters by doing:

svmTuning$best.parameters

I'm pretty sure the optimal is chosen using the mean squared error calculated based on the cross validation you chose (in the case of regression) and average classification error.

Here's another example with kernel k-nearest neighbours

knnTuning <- train.kknn(Y~X1+X2., data=dat, kmax = 40, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov","gaussian", "rank", "optimal"), ykernel = NULL, scale = TRUE,kcv=1000)

Which sweeps through all combinations of neighbors up to 40 and the different kernels but using the euclidean distance (distance=2). You may plot all these results and again obtain the best parameters:

plot(knnTuning) knnTuning$best.parameters

You could do the same for random forest:

rfTuning <- tune.randomForest(Y~X1+X2, data = dat,ntree=1000, mtry=seq(from=2,to=10,by=1), tunecontrol= tune.control(sampling="cross",cross=1000))

Where you just sweep through possible values for the amount of variables in the candidates for each split. This is known to overfit if not done carefully.

And so on and so forth. Since you appear to have a small sample size maybe leave-one-out is the way to go. Maybe you can also look into the caret package which has good capabilities for model building and the actual documentation is very solid (theoretical descriptions and all).

Stack Exchange Network

How to choose train/test sample ratio, for machine learning?

1 Answer 1

Hot Network Questions

How to choose train/test sample ratio, for machine learning?

1 Answer 1

Related

Hot Network Questions