r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).
r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.
Given a single sample, a good prediction for a given domain may appear terrible:
from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0 Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:
true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993 Taken to the other extreme, if the test size is 2 samples, and we happen to be evaluating 2 samples that are close to each other by chance, this will have substantial impact on the r^2 score, even if the predictions are quite good:
true = [20.2, 20.1] # actual target values from the Boston Housing dataset predicted = [19, 21] print(r2_score(true, predicted)) # -449.0