I was trying Random Forest Algorithm on Boston dataset to predict the house prices medv with the help of sklearn's RandomForestRegressor.
Just to evaluate how good is the model performing I tried sklearn's learning curve with below code
train_sizes = [1, 25, 50, 100, 200, 390] # 390 is 80% of shape(X) from sklearn.model_selection import learning_curve def learning_curves(estimator, X, y, train_sizes, cv): train_sizes, train_scores, validation_scores = learning_curve( estimator, X, y, train_sizes = train_sizes, cv = cv, scoring = 'neg_mean_squared_error') #print('Training scores:\n\n', train_scores) #print('\n', '-' * 70) # separator to make the output easy to read #print('\nValidation scores:\n\n', validation_scores) train_scores_mean = -train_scores.mean(axis = 1) print(train_scores_mean) validation_scores_mean = -validation_scores.mean(axis = 1) print(validation_scores_mean) plt.plot(train_sizes, train_scores_mean, label = 'Training error') plt.plot(train_sizes, validation_scores_mean, label = 'Validation error') plt.ylabel('MSE', fontsize = 14) plt.xlabel('Training set size', fontsize = 14) title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model' plt.title(title, fontsize = 18, y = 1.03) plt.legend() plt.ylim(0,40) If you notice I have passed X, y and not X_train, y_train to learning_curve.
I am not understanding should I pass X, y or only the training subset X_train, y_train to learning_curve.
Update 1
Dimensions of my Train/Test split (75%:Train and 25%:Test) X.shape: (489, 11) X_train.shape: (366, 11) X_test.shape: (123, 11) I had few additional questions regarding the working of learning_curve
Does the size of test data set varies according to the size of train dataset as mentioned in list
train_sizesor it is always fixed (which would be 25% in my case according to train/test split which is 123 samples) for example- When
train dataset size = 1the will the test data size be 488 or will it be 123(the size of X_test) - When
train dataset size = 25the will the test data size be 464 or will it be 123(the size of X_test) - When
train dataset size = 50the will the test data size be 439 or will it be 123(the size of X_test)
- When
Update 2
In the blog the dataset has 9568 observations and the blogger passes entire dataset X to learning_curve.
train_sizes = [1, 100, 500, 2000, 5000, 7654]
In first iteration when train_size is 1 then the test_size should be 9567 but why he say that
But when tested on the validation set (which has 1914 instances), the MSE rockets up to roughly 423.4.
Shouldnt the test_size be 9567 instead of 1914 for first iteration
In second iteration when the train_size is 100 then the test_size should be 9468
What I meant to say is the test_size will be variable according to train_size correct me if I am wrong