Interpreting Negative Binomial GLM results and model-fit

Question

The goal of the analysis was to:

test how much each of the predictor variables can help explain species richness to test the hypothesis a) Geodiversity is positively, consistently and significantly correlated with biodiversity (vascular plant richness) b) How much the different components of geodiversity and climate variables explain species richness (response variable)
I aggregated biodiversity, geodiversity and climate covariates into grid cells (25 x 25 km) and then used a generalized linear model (GLM) to test hypothesesis (a) and (b). About my data: Biodiversity (Species richness) is a species count that is bounded at 0. All occurrence records were identified to species level and counted at each sample location (grid cell) of the himalayas to give us species richness per grid cell.

variables: -Patterns of plant species richness are strongly controlled by climate, topography, and soil conditions. Plant diversity generally increases with warmer temperatures. Additionally, the topographical heterogeneity can cause variation in temperature within a small area (higher elevational range within a grid cell, more topographical variation). Greater elevational range within a grid cell implies more environmental gradients (temperature, humidity, solar radiation), supporting more habitats and species. I expect that the environmental heterogeneity (a variety of climate, geology, soil, hydrology, and geomorphology) will offer different habitats that allow diverse plant species to exist. Therefore, we expect the GLM to show that climatic variables have a strong, significant positive effect on species richness. As well as topographic heterogeneity (elevational range), geodiversity components which reflect the role of the abiotic habitat complexity (more plant species can occupy a niche if there is more habitat heterogeneity).

-The combined model will estimate how much species richness changes for every unit increase in each environmental predictor. The coefficients will quantify whether each variable has a significant, positive, or negative and proportional effect on species richness.

steps: First I fit a multiple linear regression model to find the residuals of the model which were not normally distributed. Therefore,

I decided to go with a GLM as the response variable has a non-normal distribution. For a GLM the first step is to choose an appropriate distribution for the resposne variable and since species richness is count data the most common options are poisson, negative binomial distributions, gamma distribution
I decided to go with Negative Binomial distribution for the GLM as poisson distribution Assumes mean = variance. I think this is due to outliers in the response variable ( one sampled grid has very high observed richness value), so the variance is larger than the mean for my data

confusion:

my understanding is very limited so bear with me, but from the model summary, I understand that Bio4,mean_annual_rsds (solar radiation), Elevational_range, and Hydrology are significant predictors of species richness. But I cannot make sense of why or how this is determined.

Also, I don't understand how certain predictor variables such as hydrology; meaning more complex hydrological features being present in the area will reduce richness? And why do variables Bio1(mean temperature) and soil (soil types) not significantly predict species richness?

I'm also finding it hard to assess whether the model fits the data well. I'm struggling to understand how I can answer that question by looking at the scatterplot of Pearsons residuals vs predicted values for example? How can I assess that this model fits the data well?

My results:

glm.nb(formula = Species_richness ~ Bio1 + Bio4 + Bio15 + Bio18 + Bio19 + Mean_annual_rsds + ElevationalRange + Soil + Hydrology + Geology + Geomorphology_Geomorphons_25km__1_, data = mydata, link = "log", init.theta = 0.7437525773) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.670e+00 4.378e-01 10.667 < 2e-16 *** Bio1 6.250e-03 4.039e-03 1.547 0.121796 Bio4 -1.606e-03 4.528e-04 -3.547 0.000389 *** Bio15 -8.046e-04 2.276e-03 -0.353 0.723722 Bio18 1.506e-04 1.050e-04 1.434 0.151635 Bio19 -6.107e-04 3.853e-04 -1.585 0.112943 Mean_annual_rsds -5.625e-02 1.796e-02 -3.132 0.001739 ** ElevationalRange 1.803e-04 3.762e-05 4.794 1.63e-06 *** Soil -6.318e-05 1.088e-04 -0.581 0.561326 Hydrology -2.963e-03 8.085e-04 -3.664 0.000248 *** Geology -1.351e-02 2.466e-02 -0.548 0.583916 Geomorphology_Geomorphons_25km__1_ 1.435e-03 1.244e-03 1.153 0.248778 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(0.7438) family taken to be 1) Null deviance: 1482.0 on 1169 degrees of freedom Residual deviance: 1319.4 on 1158 degrees of freedom AIC: 8922.6 Number of Fisher Scoring iterations: 1 Theta: 0.7438 Std. Err.: 0.0287 2 x log-likelihood: -8896.5810

What’s needed are diagnostic plots to show (1) goodness of fit of the negative binomial distribution to your data, and (2) shifts in the distribution due to covariates are properly modeled. (1) is analogous to q-q plots in ordinary linear models and (2) is analogous to parallelism of log-log survival curves in Cox proportional hazards models or parallelism of logit of cumulative distribution functions in proportional odds models. I don’t have experience with such plots for negative binomial but hope someone who reads this does. — Frank Harrell
– Frank Harrell, Commented Jul 20 at 13:06
performance::check_model(fitted_model) and library(DHARMa); plot(simulateResiduals(fitted_model)) are two standard, good ways to diagnose your model's quality. — Ben Bolker
– Ben Bolker, Commented Jul 20 at 15:36
You have 1170 separate 25 x 25 km grid cells? Presumably you took multiple samples per cell to get any sort of species diversity estimate per cell - that's a lot of hiking. — Greg
– Greg, Commented Jul 21 at 8:56
@Greg, I think this is a synthetic data set ... (maybe you know that and I'm missing the point ['whoosh'] ...) — Ben Bolker
– Ben Bolker, Commented Jul 21 at 20:18

Ben Bolker · Accepted Answer · 2025-07-23 14:00:52Z

There's a lot here — you're asking some very general questions about model workflow and interpretation — but I'll try to provide some helpful comments/answers.

I'm also finding it hard to assess whether the model fits the data well.

In one sense, whether the model fits the data well or not (e.g., a goodness-of-fit measure like an $R^2$ value, although these are a bit complicated for GLMs, see Wikipedia on pseudo-$R^2$ values) is beyond your control; that depends on how much useful information your covariates actually contain about the response.
In another sense, "fits the data well" could mean "are the assumptions of the model I'm using approximately valid?". For this (which is definitely worth checking) I would suggest you use graphical diagnostics like check_model() from the performance package (vignette here) or the tools from the DHARMa package (vignette here).

how certain predictor variables such as hydrology; meaning more complex hydrological features being present in the area will reduce richness? And why do variables Bio1(mean temperature) and soil (soil types) not significantly predict species richness?

There are several related explanations for counterintuitive results such as "X affects richness in a surprising direction" or "X doesn't have significant effects on richness"

The estimates of effects of a covariate in a multiple regression are conditional on all of the other covariates; if the covariates are not all perfectly uncorrelated (orthogonal), then each estimate will be of the effect of $X$ conditional on the values of all the other covariates (e.g. see here). This can lead to (initially) surprising results. See Morrisey et al 2018 for more detail.
Ecology is complex, so we can almost always imagine some pathway through which X (complexity of hydrological features) could affect Y (species richness). For example, complex hydrological features → changes in fish community → changes in community of terrestrial predators eating fish → changes in allochthonous inputs of nutrients to the terrestrial system → change in the terrestrial plant community ... (If you really don't think hydrological features could affect diversity, why did you put that covariate in your model ...?)
an ecologically important effect might not be significant in your data set because there is lots of noise and not a lot of signal; for example, if there's very little variation in temperature in your data, it will be hard to see a significant effect.

Some more advice:

your results may be easier to interpret if you scale and centre your predictor variables (see Schielzeth 2010)

Morrissey, Michael B. ; Ruxton, and Graeme D. Ruxton. 2018. “Multiple Regression Is Not Multiple Regressions: The Meaning of Multiple Regression and the Non-Problem of Collinearity.” Philosophy, Theory, and Practice in Biology 10. http://dx.doi.org/10.3998/ptpbio.16039257.0010.003.

Schielzeth, Holger. 2010. “Simple Means to Improve the Interpretability of Regression Coefficients: Interpretation of Regression Coefficients.” Methods in Ecology and Evolution 1 (2): 103–13. https://doi.org/10.1111/j.2041-210X.2010.00012.x.

If this solved your problem you're encouraged to click the check-mark to accept it. — Ben Bolker
– Ben Bolker, Commented Jul 23 at 14:00

Stack Exchange Network

Interpreting Negative Binomial GLM results and model-fit

My results:

1 Answer 1

Linked

Hot Network Questions

Interpreting Negative Binomial GLM results and model-fit

My results:

1 Answer 1

Linked

Related

Hot Network Questions