4
$\begingroup$

To select the most appropriate type of data distribution, it is necessary to calculate the dispersion value. This can be overdispersion and underdispersion. To determine the dispersion, I used the DHARMa package.

First sample. When fitting the GLM model, I used offset, given that "density" = "nest" (number of nests) / "area". Dependent variable is the number of nests of species2.

An example of my data. I have several independent predictors, here is one of them.

structure(list(area = c(11.7, 14.2, 9.7, 12.5, 12.7, 7.4, 7.5, 10, 11.3, 7.1, 7.7, 14.1, 5.7, 11.1, 8.8, 9, 7.8, 8.5), nest_species1 = c(1L, 6L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 0L, 4L, 1L, 5L, 4L, 1L, 1L, 1L), nest_species2 = c(2L, 1L, 2L, 0L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 0L, 0L, 1L, 0L, 0L, 1L), green = c(5.7, 6.02, 6.14, 5.6, 5.64, 6.43, 6.2, 6.05, 4.91, 5.85, 4.86, 7.14, 5.22, 6.81, 6.63, 6.76, 6.42, 6.5)), class = "data.frame", row.names = c(NA, -18L )) 

I started by fitting a GLM using Poisson:

model.1=glm (nest_species2 ~ green + offset(log(area)), data= Example, family = poisson (link="log")) summary(model.1) 
Call: glm(formula = nest_species2 ~ green + offset(log(area)), family = poisson(link = "log"), data = Example) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.4435 2.1259 -0.679 0.497 green -0.1217 0.3512 -0.347 0.729 (Dispersion parameter for poisson family taken to be 1) Null deviance: 16.523 on 17 degrees of freedom Residual deviance: 16.404 on 16 degrees of freedom AIC: 50.699 Number of Fisher Scoring iterations: 5 

We see that the residual deviation is equal to the number of degrees of freedom. That is, there is no under- or overdispersion. Next, I determine the dispersion using the DHARMa package.

library(DHARMa) testDispersion(model.1) 
DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated data: simulationOutput dispersion = 0.6911, p-value = 0.472 alternative hypothesis: two.sided 

Judging by the value of 0.6911, there is underdispersion here. Given the underdispersion, I fit the GLM using double Poisson regression:

gamlss1 <- gamlss(nest_species2 ~ green + offset(log(area)), data= Example, family = DPO) summary(gamlss1) 
Family: c("DPO", "Double Poisson") Call: gamlss(formula = nest_species2 ~ green + offset(log(area)), family = DPO, data = Example) Fitting method: RS() ------------------------------------------------------------------ Mu link function: log Mu Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.4574 1.7915 -0.813 0.429 green -0.1200 0.2958 -0.406 0.691 ------------------------------------------------------------------ Sigma link function: log Sigma Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.3143 0.3324 -0.946 0.359 ------------------------------------------------------------------ No. of observations in the fit: 18 Degrees of Freedom for the fit: 3 Residual Deg. of Freedom: 15 at cycle: 3 Global Deviance: 45.97346 AIC: 51.97346 SBC: 54.64457 

Questions.

  1. Why is the variance estimate so different when using the ratio of residual deviation to residual degrees of freedom and when using DHARMa? Should I trust the DHARMa results more?
  2. Am I correct in interpreting the results using DHARMa to indicate that underdispersion was detected?
  3. Is it correct to use double Poisson regression in my case?

Second sample. Same input data. Similar algorithm for another bird species (species1). GLM fit using Poisson:

model.2=glm (nest_species1 ~ green + offset(log(area)), data= Example, family = poisson (link="log")) summary(model.2) 
Call: glm(formula = nest_species1 ~ green + offset(log(area)), family = poisson(link = "log"), data = Example) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.4714 1.7872 -3.061 0.0022 ** green 0.6269 0.2822 2.221 0.0263 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 18.122 on 17 degrees of freedom Residual deviance: 12.918 on 16 degrees of freedom AIC: 58.467 Number of Fisher Scoring iterations: 5 

We see that the residual deviation is less than the number of degrees of freedom. These are signs of underdispersion. Next, I determine the dispersion using the DHARMa package.

library(DHARMa) testDispersion(model.2) 
DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated data: simulationOutput dispersion = 0.90849, p-value = 0.976 alternative hypothesis: two.sided 

Judging by the value of 0.90849 there is underdispersion here. Given the insufficient dispersion, I fit the GLM using double Poisson regression:

gamlss2 <- gamlss(nest_species1 ~ green + offset(log(area)), data= Example, family = DPO) summary(gamlss2) 
Family: c("DPO", "Double Poisson") Call: gamlss(formula = nest_species1 ~ green + offset(log(area)), family = DPO, data = Example) Fitting method: RS() ------------------------------------------------------------------ Mu link function: log Mu Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.4952 1.4053 -3.910 0.00139 ** green 0.6293 0.2222 2.833 0.01260 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ------------------------------------------------------------------ Sigma link function: log Sigma Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.4612 0.3041 -1.516 0.15 ------------------------------------------------------------------ No. of observations in the fit: 18 Degrees of Freedom for the fit: 3 Residual Deg. of Freedom: 15 at cycle: 3 Global Deviance: 52.57692 AIC: 58.57692 SBC: 61.24803 

Questions.

  1. Am I correct in interpreting the results using DHARMa to indicate that underdispersion was detected?
  2. How should I correctly interpret the DHARMa results (> 1 indicates overdispersion, < 1 indicates underdispersion)?
  3. Is it correct to use double Poisson regression in my case?
$\endgroup$

1 Answer 1

6
$\begingroup$

Why is the variance estimate so different when using the ratio of residual deviation to residual degrees of freedom and when using DHARMa? Should I trust the DHARMa results more?

As I commented in a previous answer, (1) the estimate based on resid_deviance/resid_df is an approximate relationship that's especially unreliable for small sample sizes; (2) DHARMa uses a computational approach that will in general be more reliable.

Am I correct in interpreting the results using DHARMa to indicate that underdispersion was detected?

There will always be some amount of over- or underdispersion in a model fit — that is, the residual variation will never be exactly equal to the expected value from the model. DHARMa reports that the residuals are underdispersed (0.6911 < 1), but not significantly so (a two-sided $p$-value of 0.472); while there are some serious drawbacks to "two-stage testing" (i.e., using a statistical test to decide how to proceed with the next modeling step), and while we can only fail to reject the null hypothesis of equidispersion (i.e. this doesn't "prove that the data are not underdispersed" ...), I'd feel comfortable saying that there's no strong evidence of underdispersion here, so a Poisson model should be fine.

Is it correct to use double Poisson regression in my case?

If you want. I don't think it's wrong to use the DP, but ...

  • I spent a while going down rabbit holes trying to get a heuristic understanding of the generating process that leads to a double Poisson (for comparison, you can interpret a negative binomial distribution as being either [1] the result of a discrete-time survival model or [2] a Poisson with an extra level of Gamma-distributed heterogeneity in the rates), but couldn't find anything. Whenever possible I prefer to use a distribution whose generating process matches my expectations of the real-world process reasonably well ...
  • furthermore, there seem to be some computational concerns about DP (see You et al 2013) [I have no idea if the package you're using accounts for the problems described there].
  • I would be more inclined to use:
    • a Poisson response (since there is no strong evidence of under/overdispersion);
    • a quasi-Poisson model (since this is a simple and easy approach to handling under/overdispersion);
    • a generalized Poisson or COM-Poisson (because these are more widely used AFAICT than the DP, although my comment about understanding the generating process applies here too).

In general, with a data set this small (18 observations) I would lean heavily to doing a simpler analysis (e.g. Poisson or quasi-Poisson).

I think my answers apply to both parts of your question.


Efron, Bradley. 1986. “Double Exponential Families and Their Use in Generalized Linear Regression.” Journal of the American Statistical Association 81 (395): 709–21. https://doi.org/10.1080/01621459.1986.10478327.

Zou, Yaotian, Srinivas Reddy Geedipally, and Dominique Lord. 2013. “Evaluating the Double Poisson Generalized Linear Model.” Accident Analysis & Prevention 59 (October):497–505. https://doi.org/10.1016/j.aap.2013.07.017.

$\endgroup$
4
  • $\begingroup$ Ben Bolker, Thanks for the detailed explanations. I wanted to clarify one more thing. How to interpret the results of 'testDispersion', which of the obtained dispersion values should be attributed to underdispersion, and which to overdispersion? $\endgroup$ Commented Jul 17 at 21:47
  • 1
    $\begingroup$ I'm not sure what you mean. testDispersion does a two-sided test. If the computed dispersion value is <1 that's underdispersion, otherwise overdispersion ... (it's worth reading the help page for ?testDispersion, there's a lot of information there ... $\endgroup$ Commented Jul 18 at 0:37
  • $\begingroup$ thanks, you answered my question. $\endgroup$ Commented Jul 18 at 22:10
  • $\begingroup$ If this answers your questions, you're encouraged to click the check-mark to accept it ... $\endgroup$ Commented Jul 18 at 22:23

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.