1
$\begingroup$

I am running a Poisson GLM on insurance claim data. I use L1/L2 regularisation to account for potential lack of full credibility in my dataset. I also have an industry table, which (log of it) I use as an offset term.

In such a setting, should I penalise the intercept?

Earlier discussions typically advise against this - eg see here or here.

However, my understanding is that a fully regularised model, with its intercept penalised, would make predictions in line with the industry table. Which is my expectation - if not strongly justified by my data, then rely on the industry.

On the other hand, by keeping the intercept unpenalised, the coefficients for the individual risk factors could be shrunk / eliminated, but the overall claim count from my data would still stay higher than the industry table implies - which I do not expect to be justified.

Thoughts on this appreciated!

$\endgroup$
5
  • $\begingroup$ What do you mean that you use the industry table as an offset term? $\endgroup$ Commented Nov 3 at 11:38
  • $\begingroup$ Neither the intercept nor the offset are penalized by default. As the links explain, this is because you usually want a model with no other parameters to predict the grand mean -- otherwise additional sensitivities are introduced in parametrization (e.g. 'adding a constant $c$ to all $y$ would not simply result in a shift of all predictions by $c$'). I'm not sure that you're using this offset in its intended way though: the offset $\text{log}(\eta)$ makes your outcome be $y/\eta$, i.e. a rate rather than a count. This would suggest your data is not proportional to your reference... $\endgroup$ Commented Nov 3 at 12:52
  • 1
    $\begingroup$ Do you want to use the industry table as part of a prior distribution, kind of like “if the data don’t refute the established ideas, go with the established ideas”? $\endgroup$ Commented Nov 3 at 12:57
  • $\begingroup$ @Dave yes, the industry table is a prior belief, like you said above. this is also needed because each row represents a different number of insured lives the model setup in R is: observed_claim_count ~ factors... + 1, offset=log(claim_count_implied_by_industry_table), family=poisson(), ... $\endgroup$ Commented Nov 3 at 14:57
  • $\begingroup$ @PBulls the use of offset in such a way is a standard practice in life insurance GLMs. We want to achieve a model that after transforming back from log space predicts claim count as: claim_count_expected_by_industry * coeff_factor1 * coreff_factor2... And yes, our goal is to model incidence rates on subgroups of insured population. and since the datasets can be small and therefore not always credible, the 'grand mean' from it is also under the question mark $\endgroup$ Commented Nov 3 at 15:03

1 Answer 1

2
$\begingroup$

Yes, in that setting it would make sense to penalise the intercept, to try to optimise the bias (from too much industry) vs variance (from too little industry) tradeoff

It would probably be better to fit a Bayesian model and specify a prior on how much variation there is between datasets. If you had multiple models of this sort it would also help to fit them jointly, so the joint model has genuine information on the variation around the industry mean.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.