Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:
cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) ... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).
This trial was repeated several times, and avg scores and standard deviations were then plotted.
(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)
Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!
