When using linear model it is important to use linearly independent features. You can visualize correlation with df.corr():
import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.decomposition import PCA from sklearn.metrics import mean_squared_error numpy.random.seed(2) dic = {'par_1': [10, 30, 11, 19, 28, 33, 23], 'par_2': [1, 3, 1, 2, 3, 3, 2], 'par_3': [15, 3, 16, 65, 24, 56, 13], 'outcome': [101, 905, 182, 268, 646, 624, 465]} df = pd.DataFrame(dic) print(df.corr())
out: par_1 par_2 par_3 outcome par_1 1.000000 0.977935 0.191422 0.913878 par_2 0.977935 1.000000 0.193213 0.919307 par_3 0.191422 0.193213 1.000000 -0.158170 outcome 0.913878 0.919307 -0.158170 1.000000
You can see that par_1 and par_2 are strongly correlated. As @taga mentioned, you can use PCA to map your features to a lower dimensional space where they are linearly independent:
variables = df.iloc[:,:-1] results = df.iloc[:,-1] pca = PCA(n_components=2) pca_all = pca.fit_transform(variables) print(np.corrcoef(pca_all[:, 0], pca_all[:, 1]))
out: [[1.00000000e+00 1.87242048e-16] [1.87242048e-16 1.00000000e+00]]
Remember to validate your model on out of sample data:
X_train = variables[:4] y_train = results[:4] X_valid = variables[4:] y_valid = results[4:] pca = PCA(n_components=2) pca.fit(X_train) pca_train = pca.transform(X_train) pca_valid = pca.transform(X_valid) print(pca_train) reg = LinearRegression() reg.fit(pca_train, y_train) yhat_train = reg.predict(pca_train) yhat_valid = reg.predict(pca_valid) print(mean_squared_error(yhat_train, y_train)) print(mean_squared_error(yhat_valid, y_valid))
Feature selection is not trivial: there is a lot of sklearn modules that achieve it (See docs) and you should always try at least a couple of them and see which on increase performance on out-of-sample data.