What features to use for regression or classification?

Question

Is there a way to determinate what features are the most relevant for my machine learning model. If i have 20 features, is there a function that will decide what features should I use (or function that will automatically remove features that are not relevant)? I planned to do this for regression or classification model.

My desired output is list of values that are most relevant, and prediction

import pandas as pd from sklearn.linear_model import LinearRegression dic = {'par_1': [10, 30, 11, 19, 28, 33, 23], 'par_2': [1, 3, 1, 2, 3, 3, 2], 'par_3': [15, 3, 16, 65, 24, 56, 13], 'outcome': [101, 905, 182, 268, 646, 624, 465]} df = pd.DataFrame(dic) variables = df.iloc[:,:-1] results = df.iloc[:,-1] print(variables.shape) print(results.shape) reg = LinearRegression() reg.fit(variables, results) x = reg.predict([[18, 2, 21]])[0] print(x)

neko · Accepted Answer · 2019-07-15 13:22:09Z

The term you are looking for is feature selection: it consists in identifying which features are the most relevant ones for your analysis. The scikit-learn library has a whole section dedicated to it here.

Another possibility is to resort to dimensionality reduction techniques, like PCA (Principal Component Analysis) or Random Projections. Each technique has its pros and cons, so much depends on the data you have and the specific application.

I have read it, but I do not know how to implement that in my code. How to get a list of the features that are most relevant

Charles Landau · Accepted Answer · 2019-07-15 14:02:55Z

You can access the coef_ attribute of your reg object:

print(reg.coef_)

It's an oversimplification to call these weights, as they have a specific meaning in linear regression. But they're what you have.

Jonathan Guymont · Accepted Answer · 2019-07-15 14:23:20Z

When using linear model it is important to use linearly independent features. You can visualize correlation with df.corr():

import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.decomposition import PCA from sklearn.metrics import mean_squared_error numpy.random.seed(2) dic = {'par_1': [10, 30, 11, 19, 28, 33, 23], 'par_2': [1, 3, 1, 2, 3, 3, 2], 'par_3': [15, 3, 16, 65, 24, 56, 13], 'outcome': [101, 905, 182, 268, 646, 624, 465]} df = pd.DataFrame(dic) print(df.corr())

out: par_1 par_2 par_3 outcome par_1 1.000000 0.977935 0.191422 0.913878 par_2 0.977935 1.000000 0.193213 0.919307 par_3 0.191422 0.193213 1.000000 -0.158170 outcome 0.913878 0.919307 -0.158170 1.000000

You can see that par_1 and par_2 are strongly correlated. As @taga mentioned, you can use PCA to map your features to a lower dimensional space where they are linearly independent:

variables = df.iloc[:,:-1] results = df.iloc[:,-1] pca = PCA(n_components=2) pca_all = pca.fit_transform(variables) print(np.corrcoef(pca_all[:, 0], pca_all[:, 1]))

out: [[1.00000000e+00 1.87242048e-16] [1.87242048e-16 1.00000000e+00]]

Remember to validate your model on out of sample data:

X_train = variables[:4] y_train = results[:4] X_valid = variables[4:] y_valid = results[4:] pca = PCA(n_components=2) pca.fit(X_train) pca_train = pca.transform(X_train) pca_valid = pca.transform(X_valid) print(pca_train) reg = LinearRegression() reg.fit(pca_train, y_train) yhat_train = reg.predict(pca_train) yhat_valid = reg.predict(pca_valid) print(mean_squared_error(yhat_train, y_train)) print(mean_squared_error(yhat_valid, y_valid))

Feature selection is not trivial: there is a lot of sklearn modules that achieve it (See docs) and you should always try at least a couple of them and see which on increase performance on out-of-sample data.

The Mask · Accepted Answer · 2019-07-15 20:47:05Z

Well, initially I faced the same problem.The two methods that I find useful for selecting relevant features are these.

1.You can get the feature importance of each feature of your dataset by using the feature importance property of the model.Feature importance is an inbuilt class that comes with Tree Based Classifiers.

import pandas as pd import numpy as np data = pd.read_csv("D://Blogs//train.csv") X = data.iloc[:,0:20] #independent columns y = data.iloc[:,-1] #target column i.e price range from sklearn.ensemble import ExtraTreesClassifier import matplotlib.pyplot as plt model = ExtraTreesClassifier() model.fit(X,y) print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers #plot graph of feature importances for better visualization feat_importances = pd.Series(model.feature_importances_, index=X.columns) feat_importances.nlargest(10).plot(kind='barh') plt.show()

click to see image

2.Correlation Matrix with Heatmap

Correlation states how the features are related to each other or the target variable. It gives an intuition of how the features are correlated with the target variable.

click to see image

This is not my research but this blog feature selection which helped to clear my doubt and I'm sure will do yours too.:)

Collectives™ on Stack Overflow

What features to use for regression or classification?

4 Answers 4

1 Comment

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Related