learning curve Sklearn

Question

I was trying Random Forest Algorithm on Boston dataset to predict the house prices medv with the help of sklearn's RandomForestRegressor.

Just to evaluate how good is the model performing I tried sklearn's learning curve with below code

 train_sizes = [1, 25, 50, 100, 200, 390] # 390 is 80% of shape(X) from sklearn.model_selection import learning_curve def learning_curves(estimator, X, y, train_sizes, cv): train_sizes, train_scores, validation_scores = learning_curve( estimator, X, y, train_sizes = train_sizes, cv = cv, scoring = 'neg_mean_squared_error') #print('Training scores:\n\n', train_scores) #print('\n', '-' * 70) # separator to make the output easy to read #print('\nValidation scores:\n\n', validation_scores) train_scores_mean = -train_scores.mean(axis = 1) print(train_scores_mean) validation_scores_mean = -validation_scores.mean(axis = 1) print(validation_scores_mean) plt.plot(train_sizes, train_scores_mean, label = 'Training error') plt.plot(train_sizes, validation_scores_mean, label = 'Validation error') plt.ylabel('MSE', fontsize = 14) plt.xlabel('Training set size', fontsize = 14) title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model' plt.title(title, fontsize = 18, y = 1.03) plt.legend() plt.ylim(0,40)

If you notice I have passed X, y and not X_train, y_train to learning_curve.

I am not understanding should I pass X, y or only the training subset X_train, y_train to learning_curve.

Update 1

Dimensions of my Train/Test split (75%:Train and 25%:Test) X.shape: (489, 11) X_train.shape: (366, 11) X_test.shape: (123, 11) I had few additional questions regarding the working of learning_curve

Does the size of test data set varies according to the size of train dataset as mentioned in list train_sizes or it is always fixed (which would be 25% in my case according to train/test split which is 123 samples) for example
- When train dataset size = 1 the will the test data size be 488 or will it be 123(the size of X_test)
- When train dataset size = 25 the will the test data size be 464 or will it be 123(the size of X_test)
- When train dataset size = 50 the will the test data size be 439 or will it be 123(the size of X_test)

Update 2

In the blog the dataset has 9568 observations and the blogger passes entire dataset X to learning_curve.

train_sizes = [1, 100, 500, 2000, 5000, 7654]

In first iteration when train_size is 1 then the test_size should be 9567 but why he say that

But when tested on the validation set (which has 1914 instances), the MSE rockets up to roughly 423.4.

Shouldnt the test_size be 9567 instead of 1914 for first iteration

In second iteration when the train_size is 100 then the test_size should be 9468

What I meant to say is the test_size will be variable according to train_size correct me if I am wrong

You might be interested in chapter 2.5.4 of my Master's thesis: arxiv.org/pdf/1707.09725 — Martin Thoma
– Martin Thoma, Commented Dec 4, 2018 at 11:13

TwinPenguins · Accepted Answer · 2018-12-04 16:49:53Z

[After Update1/2]

A learning curve shows how error changes as the training set size increases. One basically change the size of training data points and measure a desired score and compare it against a fixed test set to see how it generalizes. For you the utmost important part to note is the fixed test set. If you were to change the test set, that you could theoretically, how could you evaluate the performance of the same model by changing the size of the training size (because you are changing both datasets at the same time)?!

That is why you pass the the whole X,y in The Sklearn's learning curve, and it returns the tuple containing train_sizes, train_scores, test_scores. As far as I understood, it does a nested validation cross-validation. It basically takes the whole X,y, and split into train/test (keeping the test data strictly independent and fix in size), depending on how you pass the cv parameter, and keep increasing the training size and record the performance scores for plotting the learning curve. This ensures a proper measure of the optimal model's performance.

Concerning your Update 1: Having just discussed the details, then your Update 1 example scenario would be wrong. We train a model (no matter how), and would like to evaluate its performance on a hold-out dataset. If you change your hold-out dataset, you wouldn't know whether changing the training size led to the change in test score or the variability of the hold-out dataset itself!

Concerning your Update 2: This is exactly what we have been discussing so far. No matter what train_size is, the test_size would remain 1914, as long as the same cv method is used.

Example: To further clarify let's imagine we have the following scenario:

train_sizes = [1, 50, 100, 200, 400, 600] # X.shape: (1000, 5) # len(train_sizes)=6

If we were to use the following settings in learning_curve class in sklearn:

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0) learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

The learning_curve returns the train_sizes, train_scores, test_scores for six points as we have 6 train_sizes. And for these points the train_sizes and test_size would look like this:

Point 1: train_size: 1, test_size:200
Point 2: train_size: 50, test_size:200
Point 3: train_size: 100, test_size:200
Point 4: train_size: 200, test_size:200
Point 5: train_size: 400, test_size:200
Point 6: train_size: 600, test_size:200

I think train_sizes are clear (taken from the list we provided). The test_size remains fixed, and it is decided because we chose test_size=0.2 in the cv (recall that we have had 1000 data points, so 20% of it would be -> 1000 * 0.2 = 200), and it has nothing to do with our train_sizes!! This is the hold-out set that we keep fixed during this experiment. Hope this last example clarifies it altogether. ;-)

I would still keep that blogpost here as reference for others.

Thanks for the blog. I had already read about it while exploring learning_curve. From the blog it was not clear that in general we should pass X_train or X. In the blog I guess entire X is passed. — Rookie_123
– Rookie_123, Commented Dec 4, 2018 at 5:15
My concern was one of the aim of learning_curve is to decide what our train size should be. But if we already are passing entire dataset instead of X_train then we wont have any data to increase the size of X_train( if required) — Rookie_123
– Rookie_123, Commented Dec 4, 2018 at 5:16
Can you please look into the updates of the question and try to answer. These were the basic doubts I had at first place when I read that blog. Please try to answer them — Rookie_123
– Rookie_123, Commented Dec 4, 2018 at 5:47
No problem. Happy to help. Yes the last three comments are correct and your understadning is now correct. You finally digested the idea of learning curve. :-D — TwinPenguins
– TwinPenguins, Commented Dec 4, 2018 at 18:32

Stack Exchange Network

learning curve Sklearn

1 Answer 1

Hot Network Questions

learning curve Sklearn

1 Answer 1

Related

Hot Network Questions