Skip to main content
added possibly helpful insight
Source Link
Brian Bien
  • 632
  • 4
  • 22

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

Changing the scoring criteria to [mean absolute error][4] results in behavior that I'd expect: scoring improves with an increased number of folds in K-fold CV, rather than approaching 0 (as with the default, '[r2][5]'). The question remains why the default scoring metric results in poor performance *across both mean and STD metrics* for an increasing number of folds.

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

Changing the scoring criteria to [mean absolute error][4] results in behavior that I'd expect: scoring improves with an increased number of folds in K-fold CV, rather than approaching 0 (as with the default, '[r2][5]'). The question remains why the default scoring metric results in poor performance *across both mean and STD metrics* for an increasing number of folds.
removed separate question to post as separate q
Source Link
Brian Bien
  • 632
  • 4
  • 22

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

By approaching k=n (leave-one-out CV), I hit achieved cross-validation score of **0**. Here's a quick bit of sample code that will hopefully elicit the underlying problem (which I still haven't uncovered):
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.datasets import load_boston boston = load_boston() score = cross_val_score(RandomForestRegressor(), boston.data, boston.target, cv=LeaveOneOut()) print(score.mean()) # 0.0 

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

By approaching k=n (leave-one-out CV), I hit achieved cross-validation score of **0**. Here's a quick bit of sample code that will hopefully elicit the underlying problem (which I still haven't uncovered):
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.datasets import load_boston boston = load_boston() score = cross_val_score(RandomForestRegressor(), boston.data, boston.target, cv=LeaveOneOut()) print(score.mean()) # 0.0 

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

clarifying as not yet solved
Source Link
Brian Bien
  • 632
  • 4
  • 22

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

By approaching k=n (leave-one-out CV), I hit achieved cross-validation score of **0**. Here's a quick bit of sample code that will hopefully elicit the underlying problem (which I still haven't uncovered):
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.datasets import load_boston boston = load_boston() score = cross_val_score(RandomForestRegressor(), boston.data, boston.target, cv=LeaveOneOut()) print(score.mean()) # 0.0 

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

By approaching k=n (leave-one-out CV), I hit achieved cross-validation score of **0**. Here's a quick bit of sample code that will hopefully elicit the underlying problem:
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.datasets import load_boston boston = load_boston() score = cross_val_score(RandomForestRegressor(), boston.data, boston.target, cv=LeaveOneOut()) print(score.mean()) # 0.0 

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k) scores = cross_val_score(est, X, y, cv=cv_met) 

... where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

By approaching k=n (leave-one-out CV), I hit achieved cross-validation score of **0**. Here's a quick bit of sample code that will hopefully elicit the underlying problem (which I still haven't uncovered):
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.datasets import load_boston boston = load_boston() score = cross_val_score(RandomForestRegressor(), boston.data, boston.target, cv=LeaveOneOut()) print(score.mean()) # 0.0 
Added example code
Source Link
Brian Bien
  • 632
  • 4
  • 22
Loading
added 96 characters in body
Source Link
Brian Bien
  • 632
  • 4
  • 22
Loading
updated function call to reflect latest scikit-learn library
Source Link
Brian Bien
  • 632
  • 4
  • 22
Loading
Tweeted twitter.com/StackStats/status/783111032067723264
added 108 characters in body
Source Link
Brian Bien
  • 632
  • 4
  • 22
Loading
Source Link
Brian Bien
  • 632
  • 4
  • 22
Loading