R - multivariate glm, what to use as a p-value and odds ratio?

Question

This is for a case-control study. I need to get a p-value and an odds ratio with confidence intervals from my glm, but I'm unsure of the best approach. I have the glm set up as follows:

lroverall <- glm(diagnosis~variant+location, overall, family=binomial)

Diagnosis (case/control), variant (yes/no), and location (A,B,C) are all categorical variables taken from my 'overall' dataset.

summary(lroverall) gives the output:

Call: glm(formula = diagnosis ~ variant + location, family = binomial, data = overall) Deviance Residuals: Min 1Q Median 3Q Max -1.42270 -0.73877 0.00005 0.00005 2.67713 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5603 0.1727 3.244 0.001178 ** variantyes -1.2194 0.5367 -2.272 0.023095 * locationA -1.2050 0.2045 -5.892 3.82e-09 *** locationB -4.1156 1.0288 -4.000 6.32e-05 *** locationC -0.9249 0.2524 -3.664 0.000248 ***

For p-value, does it make sense to take the Pr(>|z|) for the variant (0.023)? Does this effectively measure association between diagnosis and variant while accounting for (removing?) effect of location? Or would I want to get a P-value for the overall model, or use a different test?

Similarly, is it appropriate to take the odds ratio for the variant (2.95e-01) calculated as below? :

exp(cbind("Odds ratio" = coef(lroverall), confint.default(lroverall, level = 0.95)) Odds ratio 2.5 % 97.5 % (Intercept) 1.751193e+00 1.248321e+00 2.456640e+00 variantyes 2.954030e-01 1.031654e-01 8.458547e-01 locationA 2.996777e-01 2.007040e-01 4.474587e-01 locationB 1.631541e-02 2.172174e-03 1.225467e-01 locationC 3.965552e-01 2.417924e-01 6.503760e-01

Plain coef(lroverall) will give you $log{O_{y|x=1} \over O_{y|x=0}}$. You need to use exp(coef(lroverall)) to get the actual odds ratio. — Digio
– Digio, Commented Feb 16, 2019 at 19:20
Hanaaa, are you sure that 'location' has only 3 levels? If so, then why are all three of them in the model? There should be only two of them in the model, as is the case with variantyes (you don't see variantno anywhere). You should run levels(overall$location) to see what's happening there. — Digio
– Digio, Commented Feb 17, 2019 at 19:35
I have exp() on the outside of cbind(), which applies to the coef(lroverall); did you mean I need an additional exp()? And sorry you're correct about the levels, there were several more locations in my actual problem. I quickly removed a few here for brevity, but I should have removed one more from the output or indicated that. — abana
– abana, Commented Feb 19, 2019 at 16:41
The exp() is fine, I was just asking about the levels because there's clearly more than 3. Are you OK with interpretation or do you still need an answer? — Digio
– Digio, Commented Feb 20, 2019 at 12:49

EdM · Accepted Answer · 2019-02-16 23:36:59Z

The issues of overall p-value calculations for glm() models are discussed on this page.

The p-values listed in the summary(glm()) reports are for differences of each individual coefficient from 0.

Interpreting these coefficients properly, however, can be difficult, and putting them together into odds ratios provides even more ways to go wrong. The default in R, which you have implicitly chosen by not specifying an alternative, is to use treatment contrasts.

With treatment contrasts the intercept is the log-odds for a particular reference scenario, in your case for variant=no at whatever your reference location happens to be (something other than A, B or C). The odds ratio for that scenario is as you have calculated it, 1.751, and its confidence intervals are OK as you calculated.

Each individual regression coefficient, however, then represents the difference associated with the predictor in question from that reference log-odds. So the log-odds for the case of variant=yes at your reference location is the sum of its coefficient with the intercept: $0.5603-1.2194=-0.6591$ for an odds ratio of 0.517. If you want the log-odds for variant=yes at location A, B, or C then you have to also add in that location's own coefficient.

Calculating the confidence intervals for specific log-odds or odds ratios has to use the information from the covariance matrix of the coefficients. You can't just use the individual standard errors (which are the square roots of the diagonal of that matrix) as there are typically covariances among the coefficient values (off-diagonal elements). Use vcov(lroverall) to get that covariance matrix. Then you need to use the formula for the variance of a sum of correlated variables to get the confidence intervals for specific cases. The rms package in R has facilities to simplify such calculations, but some find there to be a pretty steep initial learning curve for that package.

Thank you for the informative answer. I’m still a bit uncertain about p-value here. I suppose the p-value for overall fit isn't what I was asked to find. For the p-values reported by the glm: I can understand that the p-values are reported for the difference of a coefficient from 0, but in terms of my real world application--Would it be correct then to say the glm p-value for variant=yes is reported for the association of the variant with disease, after removing/‘controlling for’ effect of location? Or would I need to do an additional test? — abana
– abana, Commented Feb 19, 2019 at 17:21
@hanaaa there is a question whether you have adequately "controlled for" the effect of location with your particular model. The p-value reported for variant=yes assumes that the influence of variant on log-odds is independent of location. If that assumption is correct than you are correct. With such large baseline differences among locations, however, I would worry a lot about that assumption. You might need to consider a model with variant/location interactions. — EdM
– EdM, Commented Feb 19, 2019 at 18:59

Stack Exchange Network

R - multivariate glm, what to use as a p-value and odds ratio?

1 Answer 1

Linked

Hot Network Questions

R - multivariate glm, what to use as a p-value and odds ratio?

1 Answer 1

Linked

Related

Hot Network Questions