Using machine learning specifically for feature analysis, not predictions

Question

I'm new to machine learning and have spent the last couple months having a blast using Sci-Kit Learn to try to understand the basics of building feature sets and predictive models.

Now I'm trying to use ML on a data set not to predict future values but to understand the importance and direction (positive or negative) of each feature.

My features (X) are boolean and integer values that describe a product. My target (y) is the sales of the product. I have ~15,000 observations with 16 features a piece.

With my limited ML knowledge to this point, I'm confident that I can predict (with some level of accuracy) a new y based on a new set of features X. However I'm struggling to coherently identify, report on and present the importance and direction of each feature that makes up X.

Thus far, I've taken a two-step approach:

Use a linear regression to observe coefficients
Use a random forest to observe feature importance

The code

First, I try to get the directional impact of each feature:

from sklearn import linear_model linreg = linear_model.LinearRegression() linreg.fit(X, y) coef = linreg.coef_ ...

Second, I try to get the importance of each feature:

from sklearn import ensemble forest = ensemble.RandomForestRegressor() forest.fit(X, y) importance = forest.feature_importances_ ...

Then I multiply the two derived values together for each feature and end up with some value that maybe perhaps could be the information I'm looking for!

I'd love to know if I'm on the right track with any of this. Is this a common use case for ML? Are there tools, ideas, packages I should focus on to help guide me?

Thank you very much.

$\begingroup$ Welcome to the site, Matt :) $\endgroup$

Dawny33
– Dawny33

2016-01-14 05:34:31 +00:00
Commented Jan 14, 2016 at 5:34 — Dawny33
– Dawny33, Commented Jan 14, 2016 at 5:34

Max Flander · Accepted Answer · 2016-01-14 09:05:42Z

You don't need the linear regression to understand the effect of features in your random forest, you're better off looking at the partial dependence plots directly, this what you get when you hold all the variables fixed, and you vary one at a time. You can plot these using sklearn.ensemble.partial_depence.plot_partial_dependence. Take a look at the documentation for an example of how to use it.

Another type of model that can be useful for exploratory data analysis is a DecisionTreeClassifier, you can produce a graphical representation of this using export_graphviz

Max, thanks for the guidance. plot_patrial_dependence is really helpful, not just for this, but for future feature selection. Cheers. — Matt Harvey
– Matt Harvey, Commented Jan 15, 2016 at 6:02

Arthur Langlois · Accepted Answer · 2022-02-01 06:13:20Z

In the past few years, some researchers have worked on breaking the "black-box" character of Machine Learning Models, by building tools allowing for understanding why your chosen model makes certain decisions. Some of implementation include, but are not limited to:

SHAP Values (Recommended, and generally recognized as the most complete)
LIME
eli5
sklearn's permutation_feature_importance

While all with distinct characteristics, these tools offer insights as to "what's happening under the hood" of the model's computation, therefore shedding great light on the predictive nature of the features.

Stack Exchange Network

Using machine learning specifically for feature analysis, not predictions

2 Answers 2

Hot Network Questions

Using machine learning specifically for feature analysis, not predictions

2 Answers 2

Related

Hot Network Questions