1
$\begingroup$

Based on the answer here: Significance of categorical predictor in logistic regression I tried adding a "-1" to my model to fit it without an intercept, and see the correlations directly.

It looks like adding the "-1" only helps for the first of the variables, and doesn't help if there is more than one categorical value. I tried running it on "overweight ~ race + diet -1 " and then reversing the order of race and diet.

If race is 1st in the formula, then all 4 races show up as significant.

glm(formula = overweight ~ race + diet - 1, family = "binomial", data = data) Coefficients: Estimate Std. Error z value Pr(>|z|) race1 -1.17569 0.07916 -14.851 < 2e-16 *** race2 -1.77863 0.08446 -21.058 < 2e-16 *** race3 -1.85692 0.06967 -26.651 < 2e-16 *** race4 -1.21037 0.07175 -16.869 < 2e-16 *** diet2 -1.15341 0.09676 -11.921 < 2e-16 *** diet3 -14.21256 315.57607 -0.045 0.964078 diet4 -1.36219 0.08796 -15.486 < 2e-16 *** diet5 -2.03216 0.58765 -3.458 0.000544 *** diet6 -14.09964 186.44637 -0.076 0.939719 

When diet is first race1 is not included in the model, and race4's z value is not significant.

glm(formula = overweight ~ diet + race - 1, family = "binomial", data = data) Coefficients: Estimate Std. Error z value Pr(>|z|) diet1 -1.17569 0.07916 -14.851 < 2e-16 *** diet2 -2.32910 0.10598 -21.978 < 2e-16 *** diet3 -15.38825 315.57607 -0.049 0.961 diet4 -2.53788 0.09839 -25.794 < 2e-16 *** diet5 -3.20785 0.59015 -5.436 5.46e-08 *** diet6 -15.27533 186.44638 -0.082 0.935 race2 -0.60294 0.10888 -5.538 3.06e-08 *** race3 -0.68123 0.09790 -6.959 3.44e-12 *** race4 -0.03468 0.09804 -0.354 0.724 

I also tried subtracting 1 from each of the categorical variables, but that didn't add diet1 into the model

glm(formula = overweight ~ race -1 + diet - 1, family = "binomial", data = data) Coefficients: Estimate Std. Error z value Pr(>|z|) race1 -1.17330 0.07915 -14.823 < 2e-16 *** race2 -1.77969 0.08445 -21.073 < 2e-16 *** race3 -1.85552 0.06968 -26.628 < 2e-16 *** race4 -1.21214 0.07176 -16.892 < 2e-16 *** diet2 -1.15544 0.09675 -11.943 < 2e-16 *** diet3 -14.21292 315.57904 -0.045 0.964077 diet4 -1.36182 0.08796 -15.482 < 2e-16 *** diet5 -2.01937 0.58772 -3.436 0.000591 *** diet6 -14.09991 186.44215 -0.076 0.939717 

Is there a way to fit multiple categorical variables while keeping all the "categories" in the model? Is there a reason why this shouldn't be done?

In this case, I expect race4 to be statistically significant, but when race1 is being used as the reference race4 is not statistically significant. Is there a way to avoid this?

$\endgroup$

1 Answer 1

3
$\begingroup$

To answer your question "Is there a reason why this shouldn't be done?":

Are you familiar with the concept of linear dependence? The columns of your $X$ matrix must be linearly independent, otherwise there will be multiple coefficient vectors that produce the same fit.

An example:

set.seed(123987) link <- function(x) exp(x) / (1 + exp(x)) curve(link(x), -5, 5) # Maps R to [0, 1] n <- 100 df <- data.frame(x=runif(n, -0.5, 0.5)) # A continuous predictor, x df$f_1 <- factor(sample(letters[1:3], size=n, replace=T), levels=letters[1:3]) # Factor colors <- c("green", "purple", "blue") df$f_2 <- factor(sample(colors, size=n, replace=T), levels=colors) # A second factor df$y <- 1 * (runif(n) < link(rnorm(n) + df$x + ifelse(df$f_1=="a", -1, ifelse(df$f_1=="b", 1, 2)) + ifelse(df$f_2=="green", -0.5, ifelse(df$f_2=="purple", 0, 5)))) stopifnot(setequal(unique(df$y), c(0, 1))) fit <- glm(y ~ x + f_1 + f_2, data=df, family=binomial("logit")) coefficients(fit) # Constant, x, f_1b, f_1c, f_2purple, f_2blue X <- matrix(1, nrow=n, ncol=length(fit$coefficients)) # Manually create X matrix X[, 2] <- df$x ## No column for "a" X[, 3] <- 1*(df$f_1 == "b") X[, 4] <- 1*(df$f_1 == "c") ## No column for "green" X[, 5] <- 1*(df$f_2 == "purple") X[, 6] <- 1*(df$f_2 == "blue") colnames(X) <- c("constant", "x", "f_1b", "f_1c", "f_2green", "f_2purple") Y <- matrix(df$y, ncol=1) colnames(Y) <- "y" fit2 <- glm(Y ~ 0 + X, family=binomial("logit"), data=list(Y, X)) # X already includes const all(coefficients(fit) == coefficients(fit2)) # True # What happens if we drop the constant and put all levels of f_1 and f_2 in our matrix X? X <- matrix(NA, nrow=n, ncol=length(fit$coefficients) + 1) X[, 1] <- df$x X[, 2] <- 1*(df$f_1 == "a") X[, 3] <- 1*(df$f_1 == "b") X[, 4] <- 1*(df$f_1 == "c") X[, 5] <- 1*(df$f_2 == "green") X[, 6] <- 1*(df$f_2 == "purple") X[, 7] <- 1*(df$f_2 == "blue") colnames(X) <- c("x", "f_1a", "f_1b", "f_1c", "f_2green", "f_2purple", "f_2blue") ## The problem with this matrix is that the columns are linearly dependent X[, 2] + X[, 3] + X[, 4] # Gives a vector of all 1s -- do you understand why? X[, 5] + X[, 6] + X[, 7] # Gives a vector of all 1s, for the same reason zero_vector <- X[, 2] + X[, 3] + X[, 4] - (X[, 5] + X[, 6] + X[, 7]) all(zero_vector == 0) # True 

In the example above, I first generate some simple example data. I use glm to fit a logistic regression with a constant (and one omitted level for each factor). I then show you how to manually generate the X matrix for that model. Then I generate a new X, which includes all factor levels, and explicitly show you that its columns are linearly dependent.

If you have one factor, you can drop the constant in your model and estimate coefficients for all factor levels. (This produces the exact same fit either way; it's just the interpretation of the coefficients that changes -- in one case your coefficient is an average for that factor level, in the other it's the difference relative to the baseline, excluded level.)

But when you have two factors, it doesn't make sense to try and estimate coefficients for all levels of both factors: that will create linearly dependent columns in your X. You always have to drop one level from one factor (or two levels, one from each factor, if you include a constant).

There is another aspect of your question which is about statistical significance. I think you slightly misunderstand the meaning of the coefficients in your model, and how the interpretation changes depending on whether or not you've included a constant.

$\endgroup$
1
  • $\begingroup$ An example would be helpful. Do you mean that 'race' and 'diet' need to be linearly independent? $\endgroup$ Commented May 28, 2015 at 7:23

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.