Skip to main content
added 2813 characters in body
Source Link
Adrian
  • 4.5k
  • 3
  • 25
  • 39

In the example above, I first generate some simple example data. I use glm to fit a logistic regression with a constant (and one omitted level for each factor). I then show you how to manually generate the X matrix for that model. Then I generate a new X, which includes all factor levels, and explicitly show you that its columns are linearly dependent.

If you have one factor, you can drop the constant in your model and estimate coefficients for all factor levels. (Note that thisThis produces the exact same fit either way,way; it's just with different interpretationsthe interpretation of the coefficients) that changes -- in one case your coefficient is an average for that factor level, in the other it's the difference relative to the baseline, excluded level.)

If you have one factor, you can drop the constant in your model and estimate coefficients for all factor levels. (Note that this produces the exact same fit either way, just with different interpretations of the coefficients).

In the example above, I first generate some simple example data. I use glm to fit a logistic regression with a constant (and one omitted level for each factor). I then show you how to manually generate the X matrix for that model. Then I generate a new X, which includes all factor levels, and explicitly show you that its columns are linearly dependent.

If you have one factor, you can drop the constant in your model and estimate coefficients for all factor levels. (This produces the exact same fit either way; it's just the interpretation of the coefficients that changes -- in one case your coefficient is an average for that factor level, in the other it's the difference relative to the baseline, excluded level.)

added 2813 characters in body
Source Link
Adrian
  • 4.5k
  • 3
  • 25
  • 39

To answer your question "Is there a reason why this shouldn't be done?":

Are you familiar with the concept of linear dependence? The columns of your $X$ matrix must be linearly independent, otherwise there will be multiple coefficient vectors that produce the same fit. I'll edit

An example:

set.seed(123987) link <- function(x) exp(x) / (1 + exp(x)) curve(link(x), -5, 5) # Maps R to [0, 1] n <- 100 df <- data.frame(x=runif(n, -0.5, 0.5)) # A continuous predictor, x df$f_1 <- factor(sample(letters[1:3], size=n, replace=T), levels=letters[1:3]) # Factor colors <- c("green", "purple", "blue") df$f_2 <- factor(sample(colors, size=n, replace=T), levels=colors) # A second factor df$y <- 1 * (runif(n) < link(rnorm(n) + df$x + ifelse(df$f_1=="a", -1, ifelse(df$f_1=="b", 1, 2)) + ifelse(df$f_2=="green", -0.5, ifelse(df$f_2=="purple", 0, 5)))) stopifnot(setequal(unique(df$y), c(0, 1))) fit <- glm(y ~ x + f_1 + f_2, data=df, family=binomial("logit")) coefficients(fit) # Constant, x, f_1b, f_1c, f_2purple, f_2blue X <- matrix(1, nrow=n, ncol=length(fit$coefficients)) # Manually create X matrix X[, 2] <- df$x ## No column for "a" X[, 3] <- 1*(df$f_1 == "b") X[, 4] <- 1*(df$f_1 == "c") ## No column for "green" X[, 5] <- 1*(df$f_2 == "purple") X[, 6] <- 1*(df$f_2 == "blue") colnames(X) <- c("constant", "x", "f_1b", "f_1c", "f_2green", "f_2purple") Y <- matrix(df$y, ncol=1) colnames(Y) <- "y" fit2 <- glm(Y ~ 0 + X, family=binomial("logit"), data=list(Y, X)) # X already includes const all(coefficients(fit) == coefficients(fit2)) # True # What happens if we drop the constant and put all levels of f_1 and f_2 in our matrix X? X <- matrix(NA, nrow=n, ncol=length(fit$coefficients) + 1) X[, 1] <- df$x X[, 2] <- 1*(df$f_1 == "a") X[, 3] <- 1*(df$f_1 == "b") X[, 4] <- 1*(df$f_1 == "c") X[, 5] <- 1*(df$f_2 == "green") X[, 6] <- 1*(df$f_2 == "purple") X[, 7] <- 1*(df$f_2 == "blue") colnames(X) <- c("x", "f_1a", "f_1b", "f_1c", "f_2green", "f_2purple", "f_2blue") ## The problem with this matrix is that the columns are linearly dependent X[, 2] + X[, 3] + X[, 4] # Gives a vector of all 1s -- do you understand why? X[, 5] + X[, 6] + X[, 7] # Gives a vector of all 1s, for the same reason zero_vector <- X[, 2] + X[, 3] + X[, 4] - (X[, 5] + X[, 6] + X[, 7]) all(zero_vector == 0) # True 

If you have one factor, you can drop the constant in your model and addestimate coefficients for all factor levels. (Note that this produces the exact same fit either way, just with different interpretations of the coefficients).

But when you have two factors, it doesn't make sense to try and estimate coefficients for all levels of both factors: that will create linearly dependent columns in your X. You always have to drop one level from one factor (or two levels, one from each factor, if you include a simple R exampleconstant).

There is another aspect of your question which is about statistical significance. I think you slightly misunderstand the meaning of the coefficients in your model, and how the interpretation changes depending on whether or not you've included a constant.

Are you familiar with the concept of linear dependence? The columns of your $X$ matrix must be linearly independent, otherwise there will be multiple coefficient vectors that produce the same fit. I'll edit and add a simple R example.

To answer your question "Is there a reason why this shouldn't be done?":

Are you familiar with the concept of linear dependence? The columns of your $X$ matrix must be linearly independent, otherwise there will be multiple coefficient vectors that produce the same fit.

An example:

set.seed(123987) link <- function(x) exp(x) / (1 + exp(x)) curve(link(x), -5, 5) # Maps R to [0, 1] n <- 100 df <- data.frame(x=runif(n, -0.5, 0.5)) # A continuous predictor, x df$f_1 <- factor(sample(letters[1:3], size=n, replace=T), levels=letters[1:3]) # Factor colors <- c("green", "purple", "blue") df$f_2 <- factor(sample(colors, size=n, replace=T), levels=colors) # A second factor df$y <- 1 * (runif(n) < link(rnorm(n) + df$x + ifelse(df$f_1=="a", -1, ifelse(df$f_1=="b", 1, 2)) + ifelse(df$f_2=="green", -0.5, ifelse(df$f_2=="purple", 0, 5)))) stopifnot(setequal(unique(df$y), c(0, 1))) fit <- glm(y ~ x + f_1 + f_2, data=df, family=binomial("logit")) coefficients(fit) # Constant, x, f_1b, f_1c, f_2purple, f_2blue X <- matrix(1, nrow=n, ncol=length(fit$coefficients)) # Manually create X matrix X[, 2] <- df$x ## No column for "a" X[, 3] <- 1*(df$f_1 == "b") X[, 4] <- 1*(df$f_1 == "c") ## No column for "green" X[, 5] <- 1*(df$f_2 == "purple") X[, 6] <- 1*(df$f_2 == "blue") colnames(X) <- c("constant", "x", "f_1b", "f_1c", "f_2green", "f_2purple") Y <- matrix(df$y, ncol=1) colnames(Y) <- "y" fit2 <- glm(Y ~ 0 + X, family=binomial("logit"), data=list(Y, X)) # X already includes const all(coefficients(fit) == coefficients(fit2)) # True # What happens if we drop the constant and put all levels of f_1 and f_2 in our matrix X? X <- matrix(NA, nrow=n, ncol=length(fit$coefficients) + 1) X[, 1] <- df$x X[, 2] <- 1*(df$f_1 == "a") X[, 3] <- 1*(df$f_1 == "b") X[, 4] <- 1*(df$f_1 == "c") X[, 5] <- 1*(df$f_2 == "green") X[, 6] <- 1*(df$f_2 == "purple") X[, 7] <- 1*(df$f_2 == "blue") colnames(X) <- c("x", "f_1a", "f_1b", "f_1c", "f_2green", "f_2purple", "f_2blue") ## The problem with this matrix is that the columns are linearly dependent X[, 2] + X[, 3] + X[, 4] # Gives a vector of all 1s -- do you understand why? X[, 5] + X[, 6] + X[, 7] # Gives a vector of all 1s, for the same reason zero_vector <- X[, 2] + X[, 3] + X[, 4] - (X[, 5] + X[, 6] + X[, 7]) all(zero_vector == 0) # True 

If you have one factor, you can drop the constant in your model and estimate coefficients for all factor levels. (Note that this produces the exact same fit either way, just with different interpretations of the coefficients).

But when you have two factors, it doesn't make sense to try and estimate coefficients for all levels of both factors: that will create linearly dependent columns in your X. You always have to drop one level from one factor (or two levels, one from each factor, if you include a constant).

There is another aspect of your question which is about statistical significance. I think you slightly misunderstand the meaning of the coefficients in your model, and how the interpretation changes depending on whether or not you've included a constant.

Source Link
Adrian
  • 4.5k
  • 3
  • 25
  • 39

Are you familiar with the concept of linear dependence? The columns of your $X$ matrix must be linearly independent, otherwise there will be multiple coefficient vectors that produce the same fit. I'll edit and add a simple R example.