2

I am using the learner regr.gbm to predict counts. Outside of mlr, using the gbm package directly, I use distribution = "poisson" and predict.gbm, using type = "response", returns predictions on the original scale, however I note that when I do this using mlr, the predictions appear to be on the log scale:

 truth response 913 4 0.67348708 914 1 0.28413256 915 3 0.41871237 916 1 0.13027792 2101 1 -0.02092168 2102 2 0.23394970 

However, the "truth" is not on the log scale and so I am concerned that the hyper-parameter tuning routines in mlr will not work. For comparison, this is the output I get with distribution = "gaussian".

 truth response 913 4 2.028177 914 1 1.334658 915 3 1.552846 916 1 1.153072 2101 1 1.006362 2102 2 1.281811 

What is the best way to handle this ?

1
  • mlr doesn't do any processing of the predictions returned by gbm -- can you post a complete example that demonstrates the problem please? Commented Oct 31, 2018 at 17:18

1 Answer 1

5

This happens because gbm by default makes a prediction on the link function scale (which is log for distribution = "poisson"). This is governed by the type parameter of gbm::predict.gbm (see the help page of that function). Unfortunately mlr does not offer to change this parameter by default (it was reported in the mlr bugtracker). A workaround for now is to add this parameter by hand:

lrn <- makeLearner("regr.gbm", distribution = "poisson") lrn$par.set <- c(lrn$par.set, makeParamSet( makeDiscreteLearnerParam("type", c("link", "response"), default = "link", when = "predict", tunable = FALSE))) lrn <- setHyperPars(lrn, type = "response") # show that it works: counttask <- makeRegrTask("counttask", getTaskData(pid.task), target = "pregnant") pred <- predict(train(lrn, counttask), counttask) pred 

Be aware that when tuning parameters on count data, the default regression measure (mean of squared errors) will possibly overemphasize fit for datapoints with large count values. The squared error for predicting "10" instead of "1" is the same as the error of predicting "1010" instead of "1001", but depending on your objective you probably want to put more weight on the first error in this example.

A possible solution is to use (normalized) mean Poisson log likelihood as measure:

poisllmeasure = makeMeasure( id = "poissonllnorm", minimize = FALSE, best = 0, worst = -Inf, properties = "regr", name = "Mean Poisson Log Likelihood", note = "For count data. Normalized to 0 for perfect fit.", fun = function(task, model, pred, feats, extra.args) { mean(dpois(pred$data$truth, pred$data$response, log = TRUE) - dpois(pred$data$truth, pred$data$truth, log = TRUE)) }) # example performance(pred, poisllmeasure) 

This measure can be used for tuning by giving it to the measures parameter in tuneParams(). (Note you will have to give it in a list: tuneParams(... measures = list(poisllmeasure) ...))

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.