Why is a large choice of K lowering my cross validation score?

Question

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met)

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

Changing the scoring criteria to mean absolute error results in behavior that I'd expect: scoring improves with an increased number of folds in K-fold CV, rather than approaching 0 (as with the default, 'r2'). The question remains why the default scoring metric results in poor performance across both mean and STD metrics for an increasing number of folds.

Any duplicate records in your folds? This might be due to overfitting. — Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse, Commented Oct 3, 2016 at 16:54
@Anony-Mousse No, since the Boston Housing dataset does not have duplicate records and ShuffleSplit's sampling does not cause duplicate records. — Brian Bien
– Brian Bien, Commented Oct 3, 2016 at 17:31
Also, improve your plotting. Use error bars, to show mean, +- stddev, and min/max. Put k on the other axis. — Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse, Commented Oct 3, 2016 at 18:28
I don't think that more training examples increase the chance of overfitting. I plotted a learning curve with this dataset, again using ShuffleSplit (n_splits=300 with various test sizes) and saw consistently increased accuracy as more training examples were made available. — Brian Bien
– Brian Bien, Commented Oct 4, 2016 at 5:20
sorry, you are right, more is better and best is 1. But you don't have this issue if you use mean squared or absolute error. So it has to do something with the error term — rep_ho
– rep_ho, Commented Oct 7, 2016 at 11:52

Brian Bien · Accepted Answer · 2016-10-08 20:31:31Z

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993

Taken to the other extreme, if the test size is 2 samples, and we happen to be evaluating 2 samples that are close to each other by chance, this will have substantial impact on the r^2 score, even if the predictions are quite good:

true = [20.2, 20.1] # actual target values from the Boston Housing dataset predicted = [19, 21] print(r2_score(true, predicted)) # -449.0

This is a great explanation. I'm surprised the issue of r^2 score inaccuracy on small validation sets is not discussed anywhere in the scikit-learn documentation. — Inquisitive
– Inquisitive, Commented Jan 8 at 5:48

Stack Exchange Network

Why is a large choice of K lowering my cross validation score?

Update

1 Answer 1

Hot Network Questions

Why is a large choice of K lowering my cross validation score?

Update

1 Answer 1

Related

Hot Network Questions