How to get different Variable Importance for each class in a binary h2o GBM in R?

Question

I'm trying to explore the use of a GBM with h2o for a classification issue to replace a logistic regression (GLM). The non-linearity and interactions in my data make me think a GBM is more suitable.

I've ran a baseline GBM (see below) and compared the AUC against the AUC of the logistic regression. THe GBM performs much better.

In a classic linear logistic regression, one would be able to see the direction and effect of each of the predictors (x) on the outcome variable (y).

Now, I would like to evaluate the variable importance of the estimate GBM in the same way.

How does one obtain the variable importance for each of the (two) classes?

I know that the variable importance is not the same as the estimated coefficient in a logistic regression, but it would help me to understand which predictor impacts what class.

Others have asked similar questions, but the answers provided won't work for the H2O object.

Any help is much appreciated.

example.gbm <- h2o.gbm( x = c("list of predictors"), y = "binary response variable", training_frame = data, max_runtime_secs = 1800, nfolds=5, stopping_metric = "AUC")

Yes, but that command gives the variable importance for both classes. — wake_wake
– wake_wake, Commented Dec 2, 2017 at 18:29
what is that you are referring to as 'linear logistic regression' . Would it be possible for you to elaborate further on what do you mean by 'variable importance for both classes' by an example and why it would not be same set of 'variable importance' for predicting both classes. — Gangesh Dubey
– Gangesh Dubey, Commented Dec 13, 2017 at 11:40
@GangeshDubey with 'linear logistic regression' in this case I simply refer to a regression for a binary variable. With respect to the 'variable importance for both classes' see the link above. — wake_wake
– wake_wake, Commented Dec 13, 2017 at 11:43
thanks, looked at the documentation for both h2o.gbm and h2o.grid , there appears to be no direct method to achieve it.in fact, had a look at the source code, you can validate that h2o.varimp returns a single value. — Gangesh Dubey
– Gangesh Dubey, Commented Dec 13, 2017 at 14:16

Sixiang.Hu · Accepted Answer · 2018-01-11 22:43:14Z

AFAIS, the more powerful a machine learning method, the more complex to explain what's going on beneath it.

The advantages of GBM method (as you mentioned already) also bring in difficulties to understand the model. This is especailly true for numeric varialbes when a GBM model may utilise value ranges differently that some may have positive impacts whereas others have negative effects.

For GLM, when there is no interaction specified, a numeric variable would be monotonic, hence you can have positive or negative impact examed.

Now that a total view is difficult, is there any method we can analyse the model? There are 2 methods we can start with:

Partial Dependence Plot

h2o provides h2o.partialplot that gives the partial (i.e. marginal) effect for each variable, which can be seen as the effect:

library(h2o) h2o.init() prostate.path <- system.file("extdata", "prostate.csv", package="h2o") prostate.hex <- h2o.uploadFile(path = prostate.path, destination_frame = "prostate.hex") prostate.hex[, "CAPSULE"] <- as.factor(prostate.hex[, "CAPSULE"] ) prostate.hex[, "RACE"] <- as.factor(prostate.hex[,"RACE"] ) prostate.gbm <- h2o.gbm(x = c("AGE","RACE"), y = "CAPSULE", training_frame = prostate.hex, ntrees = 10, max_depth = 5, learn_rate = 0.1) h2o.partialPlot(object = prostate.gbm, data = prostate.hex, cols = "AGE")

Individual Analyser

LIME package [https://github.com/thomasp85/lime] provides capability to check variables contribution for each of observations. Luckily, this r package supports h2o already.

David Arenburg · Accepted Answer · 2018-05-15 17:53:33Z

1

You can try h2o.varimp(object)

edited May 15, 2018 at 17:53

David Arenburg

92.4k18 gold badges145 silver badges202 bronze badges

answered Jan 22, 2018 at 0:02

Saranga

1441 silver badge3 bronze badges

Collectives™ on Stack Overflow

How to get different Variable Importance for each class in a binary h2o GBM in R?

2 Answers 2

Partial Dependence Plot

Individual Analyser

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Partial Dependence Plot

Individual Analyser

Comments

Comments

Linked

Related