2
$\begingroup$

To select the most appropriate type of data distribution, it is necessary to calculate the dispersion value. This can be overdispersion and underdispersion. To determine the dispersion, I used the DHARMa package. When fitting the GLM model, I used offset, given that "density" = "nest" (number of nests) / "area". Dependent variable is the number of nests of species.

An example of my data. I have several independent predictors, here is one of them.

structure(list(area = c(11.7, 14.2, 9.7, 12.5, 12.7, 7.4, 7.5, 10, 11.3, 7.1, 7.7, 14.1, 5.7, 11.1, 8.8, 9, 7.8, 8.5), nest_species = c(1L, 6L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 0L, 4L, 1L, 5L, 4L, 1L, 1L, 1L), green = c(5.7, 6.02, 6.14, 5.6, 5.64, 6.43, 6.2, 6.05, 4.91, 5.85, 4.86, 7.14, 5.22, 6.81, 6.63, 6.76, 6.42, 6.5)), class = "data.frame", row.names = c(NA, -18L)) 

I started by fitting a GLM using Poisson:

model=glm (nest_species ~ green + offset(log(area)), data= Example, family = poisson (link="log")) summary(model) 
Call: glm(formula = nest_species ~ green + offset(log(area)), family = poisson(link = "log"), data = Example) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.4714 1.7872 -3.061 0.0022 ** green 0.6269 0.2822 2.221 0.0263 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 18.122 on 17 degrees of freedom Residual deviance: 12.918 on 16 degrees of freedom AIC: 58.467 Number of Fisher Scoring iterations: 5 

I determine the dispersion using the DHARMa package.

library(DHARMa) testDispersion(model) 
DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated data: simulationOutput dispersion = 0.90849, p-value = 0.976 alternative hypothesis: two.sided 

Judging by the value of 0.90849 there is probably insufficient dispersion here. Given the insufficient dispersion, I fit the GLM using Conway-Maxwell-Poisson regressions:

model_COM <-glmmTMB(nest_species ~ green + offset(log(area)), data= Example, family="compois") summary(model_COM) Family: compois ( log ) Formula: nest_species ~ green + offset(log(area)) Data: Example AIC BIC logLik -2*log(L) df.resid 59.6 62.3 -26.8 53.6 15 Dispersion parameter for compois family (): 0.643 Conditional model: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.3754 1.5075 -3.566 0.000363 *** green 0.6125 0.2374 2.579 0.009896 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Note that the p-value here is significantly lower than in the Poisson regression. It should be noted that I switched to glmmTMB to implement Conway-Maxwell-Poisson regression. In the syntax, I did not use the random effect that is used in classical GLMMs.

Question: Have I specified the correct syntax for applying Conway-Maxwell-Poisson regression in my case?

$\endgroup$
2
  • $\begingroup$ 1. $p = 0.976$ means there is no evidence of over/underdispersion. 2. Why do you think area should be an offset, rather than an explanatory variable in the model? $\endgroup$ Commented Jul 19 at 5:33
  • $\begingroup$ @Frans Rodenburg, The explanatory variable is the number of nests (population density) of one bird species. $\endgroup$ Commented Jul 19 at 8:31

1 Answer 1

5
$\begingroup$

I'm going to answer the specific question here and then make some (unsolicited) advice.

The syntax is correct as far as I can tell.

However, I think you may not be following the best modeling practice. As I have commented in a previous answer in this series:

  • a dispersion of 0.908, while technically underdispersed, is not very underdispersed (10% below the expected value);
  • furthermore, the $p$-value (0.976) is non-significant by any reasonable critical value cut-off, and
  • the AIC of the COM-Poisson model that accounts for underdispersion is higher (worse) than the simpler Poisson model (59.6 > 58.5) [if we accounted for the finite size of the sample using a criterion like the corrected AIC (AICc), the more complex model would appear even worse]

You say

To select the most appropriate type of data distribution, it is necessary to calculate the dispersion value.

This is technically true, but it is only one small aspect of the modeling process. My normal modeling procedure would be to

  • pick a candidate model, based on the
    • response type (continuous, proportion, count etc);
    • size and other characteristics of the data set (which controls how complex a model I think I should try to fit — e.g. should I include zero-inflation, or random effects, or interaction terms, or splines/smooth terms to account for nonlinearity, or ... ??);
    • the goal of the model (exploration, prediction, inference).
  • look at graphical diagnostics in a holistic way, considering many different ways in which the model could be wrong (nonlinearity, outliers, failure of distributional assumptions, heteroscedasticity, ...). DHARMa and performance::check_model() are my go-to tools these days.

The graphical output from these packages (shown below) isn't perfect, but doesn't show any problems that I think I can fix in this case — for example, this data set is too small to feasibly fit a model that tries to account for heteroscedasticity.

Finally, you mention that

the p-value [of the coefficients] here [i.e., for the COM-Poisson fit] is significantly lower than in the Poisson regression

In my opinion, you should never consider the significance of regression coefficients when choosing which model to use; that way lies data snooping.


DHARMa plot showing no particular problems

enter image description here

$\endgroup$
2
  • $\begingroup$ Thanks for the comments, they are very helpful. $\endgroup$ Commented Jul 20 at 21:36
  • 1
    $\begingroup$ If this solved your problem you are encouraged to click the check-mark to accept the answer. $\endgroup$ Commented Jul 20 at 21:57

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.