I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split # load dataset df = pd.read_csv('Dataset.csv') # split into input (X) and output (Y) variables X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1) Y = df.TARGET_COLUMN # Split the data into 67% for training and 33% for testing X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33) # Fitting the regression model to the dataset regressor = RandomForestRegressor(n_estimators = 100, random_state = 50) regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message print("Predicting Values:") y_pred = regressor.predict(X_test) print("Getting Model Performance...") # Get regression scores print("R^2 train = ", regressor.score(X_train, Y_train)) print("R^2 test = ", regressor.score(X_test, Y_test)) This outputs the following:
Predicting Values: Getting Model Performance... R^2 train = 0.9791000275450427 R^2 test = 0.8577464692386905 Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:
diff = [] for i in range(len(y_pred)): if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!
np.mean(diff) >>> 49.07580695857447 So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.
I also checked the rmse score:
import math rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2)) rmse >>> 3.67328471827293 This seems quite high for the model to have done a good job, but please correct me if I'm wrong.
And I also checked the R^2 scores for different number of estimators:
import matplotlib.pyplot as plt model = RandomForestRegressor(n_jobs=-1) # Try different numbers of n_estimators estimators = np.arange(10, 200, 10) scores = [] for n in estimators: model.set_params(n_estimators=n) model.fit(X_train, Y_train) scores.append(model.score(X_test, Y_test)) plt.title("Effect of n_estimators") plt.xlabel("n_estimator") plt.ylabel("score") plt.plot(estimators, scores) Please advise.
I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score lr = LinearRegression() lr.fit(X_train, y_train) y_pred = lr.predict(X_test) # The coefficients print('Coefficients: \n', lr.coef_) # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % r2_score(y_test, y_pred)) Coefficients: [ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01 6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06 1.95335035e+00] Mean squared error: 86.92 Variance score: 0.08 