I have been dabbling in NB regression for less than a year now. I have applied the well known g.o.f. tests. Lately I started using the Conditional Moment (CM) test, described in Cameron and Trivedi 2003 Regression Analysis of Count-Data book. (Chp. 5, pg. 194). It basically looks at the expected versus observed counts, and tests the g.o.f. based on Chi-Square statistic.
Here is my problem: I generate synthetic NB2 data, where the linear predictor is a function of two covariates. Below is the R code borrowed from Hilbe's [NB Book]:
nb2 <- function(nobs = 5000, off = 0, xv = c(-16.50, 1.65, 0.75)) { p <- length(xv) -1 x1 <- 1 x2 <- rnorm(nobs,mean = 10000, sd = 2100) x3 <- 0.1 + runif(nobs) X <- cbind(1, log(x2), x3) xb <- X %*% xv alpha = 6.50 exb <- exp(xb + off) # Poisson predicted value xg <- rgamma(n = nobs, shape = alpha, rate = alpha) # generate gamma variates given alpha xbg <-exb*xg # mix Poisson and gamma variates nby <- rpois(nobs, xbg) # generate NB2 variates out <- data.frame(cbind(nby, x2, x3)) names(out) <- c("y", "x2", "x3") return(out) } I generate 5000 random NB2 data:
data <- nb2(nobs = 5000) The summary table of the generated data is:
Then I fit the data to NB2 model using glm.nb function in R using the 2 covariates, as follows:
m1 <- glm.nb(CRASH ~ log(x2) + x3, data = data) The results are:
Deviance Residuals: Min 1Q Median 3Q Max -1.4137 -0.9121 -0.7612 0.5488 3.3634 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -15.20781 1.03408 -14.707 <2e-16 *** log(x2) 1.50963 0.11167 13.519 <2e-16 *** x3 0.71873 0.07901 9.097 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(4.7311) family taken to be 1) Null deviance: 4714.7 on 4999 degrees of freedom Residual deviance: 4433.2 on 4997 degrees of freedom AIC: 8437.5 Number of Fisher Scoring iterations: 1 Theta: 4.73 Std. Err.: 1.16 2 x log-likelihood: -8429.509 When I calculate the Pearson Statistic, I get 1.0063, which is expected. I also apply the CM test for testing how well the expected versus observed count match using 11 bins (where the last bin includes counts of 10 and more). The corresponding Chi Square statistic is 5.77, which is significant for df = 10. This can also be visually verified using the Rootogram plots of the model, where the expected and observed counts overlap perfectly.
Then I try an intercept only model, the results are:
Deviance Residuals: Min 1Q Median 3Q Max -0.8938 -0.8938 -0.8938 0.6647 3.2406 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.83933 0.02325 -36.11 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(2.5845) family taken to be 1) Null deviance: 4419.8 on 4999 degrees of freedom Residual deviance: 4419.8 on 4999 degrees of freedom AIC: 8705.4 Number of Fisher Scoring iterations: 1 Theta: 2.584 Std. Err.: 0.410 2 x log-likelihood: -8701.442 Where, the Pearson Statistic is 1.0060, and the Chi Square statistic for the CM test is 3.897 (which is again significant). The rootogram is:
1- I did the same with with one covariate only , the results are similar, where I do not see the effect of removing a significant covariate.
2- I tried using more than two covariates up to six, still the same result.
3- I did this with many different random sets, with varying sample sizes and I get similar results, where I do not see the effect of removing the covariates, and looks like an intercept only model or one covariate does the job (Of course, we can see that the LL values are much improved when adding covariates).
4- On the other hand, when I repeat this experiment with synthetic Poisson data, fit the model using a Poisson regression model, and carry on with the same approach, the impact of removing a covariate is significant.
Does any one have an explanation for this? I would appreciate any input, comment, suggestion.


