Skip to main content
additional example
Source Link
Brian Bien
  • 632
  • 4
  • 22

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0 

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993 

Taken to the other extreme, if the test size is 2 samples, and we happen to be evaluating 2 samples that are close to each other by chance, this will have substantial impact on the r^2 score, even if the predictions are quite good:

true = [20.2, 20.1] # actual target values from the Boston Housing dataset predicted = [19, 21] print(r2_score(true, predicted)) # -449.0 

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0 

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993 

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0 

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993 

Taken to the other extreme, if the test size is 2 samples, and we happen to be evaluating 2 samples that are close to each other by chance, this will have substantial impact on the r^2 score, even if the predictions are quite good:

true = [20.2, 20.1] # actual target values from the Boston Housing dataset predicted = [19, 21] print(r2_score(true, predicted)) # -449.0 
grammar
Source Link
Brian Bien
  • 632
  • 4
  • 22

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single, sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0 

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993 

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single, sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0 

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993 

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0 

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993 
Source Link
Brian Bien
  • 632
  • 4
  • 22

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single, sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score true = [1] predicted = [1.01] # prediction of a single value, off by 1% print(r2_score(true, predicted)) # 0.0 

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3] predicted = [1.01, 2.02, 3.03] print(r2_score(true, predicted)) # 0.9993