5
$\begingroup$

I'm new to machine learning and have spent the last couple months having a blast using Sci-Kit Learn to try to understand the basics of building feature sets and predictive models.

Now I'm trying to use ML on a data set not to predict future values but to understand the importance and direction (positive or negative) of each feature.

My features (X) are boolean and integer values that describe a product. My target (y) is the sales of the product. I have ~15,000 observations with 16 features a piece.

With my limited ML knowledge to this point, I'm confident that I can predict (with some level of accuracy) a new y based on a new set of features X. However I'm struggling to coherently identify, report on and present the importance and direction of each feature that makes up X.

Thus far, I've taken a two-step approach:

  1. Use a linear regression to observe coefficients
  2. Use a random forest to observe feature importance

The code

First, I try to get the directional impact of each feature:

from sklearn import linear_model linreg = linear_model.LinearRegression() linreg.fit(X, y) coef = linreg.coef_ ... 

Second, I try to get the importance of each feature:

from sklearn import ensemble forest = ensemble.RandomForestRegressor() forest.fit(X, y) importance = forest.feature_importances_ ... 

Then I multiply the two derived values together for each feature and end up with some value that maybe perhaps could be the information I'm looking for!

I'd love to know if I'm on the right track with any of this. Is this a common use case for ML? Are there tools, ideas, packages I should focus on to help guide me?

Thank you very much.

$\endgroup$
1
  • $\begingroup$ Welcome to the site, Matt :) $\endgroup$ Commented Jan 14, 2016 at 5:34

2 Answers 2

5
$\begingroup$

You don't need the linear regression to understand the effect of features in your random forest, you're better off looking at the partial dependence plots directly, this what you get when you hold all the variables fixed, and you vary one at a time. You can plot these using sklearn.ensemble.partial_depence.plot_partial_dependence. Take a look at the documentation for an example of how to use it.

Another type of model that can be useful for exploratory data analysis is a DecisionTreeClassifier, you can produce a graphical representation of this using export_graphviz

$\endgroup$
1
  • $\begingroup$ Max, thanks for the guidance. plot_patrial_dependence is really helpful, not just for this, but for future feature selection. Cheers. $\endgroup$ Commented Jan 15, 2016 at 6:02
0
$\begingroup$

In the past few years, some researchers have worked on breaking the "black-box" character of Machine Learning Models, by building tools allowing for understanding why your chosen model makes certain decisions. Some of implementation include, but are not limited to:

  • SHAP Values (Recommended, and generally recognized as the most complete)
  • LIME
  • eli5
  • sklearn's permutation_feature_importance

While all with distinct characteristics, these tools offer insights as to "what's happening under the hood" of the model's computation, therefore shedding great light on the predictive nature of the features.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.