Predicting counts using mlr

Question

I am using the learner regr.gbm to predict counts. Outside of mlr, using the gbm package directly, I use distribution = "poisson" and predict.gbm, using type = "response", returns predictions on the original scale, however I note that when I do this using mlr, the predictions appear to be on the log scale:

 truth response 913 4 0.67348708 914 1 0.28413256 915 3 0.41871237 916 1 0.13027792 2101 1 -0.02092168 2102 2 0.23394970

However, the "truth" is not on the log scale and so I am concerned that the hyper-parameter tuning routines in mlr will not work. For comparison, this is the output I get with distribution = "gaussian".

 truth response 913 4 2.028177 914 1 1.334658 915 3 1.552846 916 1 1.153072 2101 1 1.006362 2102 2 1.281811

What is the best way to handle this ?

mlr doesn't do any processing of the predictions returned by gbm -- can you post a complete example that demonstrates the problem please? — Lars Kotthoff
– Lars Kotthoff, Commented Oct 31, 2018 at 17:18

mb706 · Accepted Answer · 2018-11-01 19:44:05Z

This happens because gbm by default makes a prediction on the link function scale (which is log for distribution = "poisson"). This is governed by the type parameter of gbm::predict.gbm (see the help page of that function). Unfortunately mlr does not offer to change this parameter by default (it was reported in the mlr bugtracker). A workaround for now is to add this parameter by hand:

lrn <- makeLearner("regr.gbm", distribution = "poisson") lrn$par.set <- c(lrn$par.set, makeParamSet( makeDiscreteLearnerParam("type", c("link", "response"), default = "link", when = "predict", tunable = FALSE))) lrn <- setHyperPars(lrn, type = "response") # show that it works: counttask <- makeRegrTask("counttask", getTaskData(pid.task), target = "pregnant") pred <- predict(train(lrn, counttask), counttask) pred

Be aware that when tuning parameters on count data, the default regression measure (mean of squared errors) will possibly overemphasize fit for datapoints with large count values. The squared error for predicting "10" instead of "1" is the same as the error of predicting "1010" instead of "1001", but depending on your objective you probably want to put more weight on the first error in this example.

A possible solution is to use (normalized) mean Poisson log likelihood as measure:

poisllmeasure = makeMeasure( id = "poissonllnorm", minimize = FALSE, best = 0, worst = -Inf, properties = "regr", name = "Mean Poisson Log Likelihood", note = "For count data. Normalized to 0 for perfect fit.", fun = function(task, model, pred, feats, extra.args) { mean(dpois(pred$data$truth, pred$data$response, log = TRUE) - dpois(pred$data$truth, pred$data$truth, log = TRUE)) }) # example performance(pred, poisllmeasure)

This measure can be used for tuning by giving it to the measures parameter in tuneParams(). (Note you will have to give it in a list: tuneParams(... measures = list(poisllmeasure) ...))

Collectives™ on Stack Overflow

Predicting counts using mlr

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related