Skip to main content
added 19 characters in body
Source Link
User1865345
  • 11.9k
  • 13
  • 27
  • 42
x <- model.matrix(Use ~ Habitat + Season + A + B + C + D + E + F + G, data = df)[, -1] y <- df$Use # must be numeric (0/1) # Fit LASSO Lasso_fit <- glmnet(x, y, alpha = 1, family = "binomial", standardize = TRUE) # Plot coefficient paths plot(Lasso_fit, xvar = "lambda", label = TRUE) plot(Lasso_fit, xvar = "dev", label = TRUE) # Cross-validation to find optimal lambda set.seed(123) cv_lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class") plot(cv_lasso) best_lambda <- cv_lasso$lambda.min # Extract coefficients at best lambda coef(cv_lasso, s = "lambda.min") 
# convert to matrix x <- model.matrix(Use ~ Habitat + Season + A + B + C + D + E + F + G, data = df)[, -1] y <- df$Use # must be numeric (0/1) # Fit LASSO Lasso_fit <- glmnet(x, y, alpha = 1, family = "binomial", standardize = TRUE) # Plot coefficient paths plot(Lasso_fit, xvar = "lambda", label = TRUE) plot(Lasso_fit, xvar = "dev", label = TRUE) # Cross-validation to find optimal lambda set.seed(123) cv_lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class") plot(cv_lasso) best_lambda <- cv_lasso$lambda.min # Extract coefficients at best lambda coef(cv_lasso, s = "lambda.min") 
> coef(cv_lasso, s = "lambda.min") 15 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -4.3742763020 habB 0.3448104589 habC -0.0012576147 habD 0.5056425623 habE -0.1176552784 SeasonSpring -0.0660149736 SeasonSummer . SeasonWinter . A 1.3288390486 B 0.4343750355 C -0.0302584569 D 0.3157774736 E . F -0.0003280062 G 0.0563677638 
> coef(cv_lasso, s = "lambda.min") 15 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -4.3742763020 habB 0.3448104589 habC -0.0012576147 habD 0.5056425623 habE -0.1176552784 SeasonSpring -0.0660149736 SeasonSummer . SeasonWinter . A 1.3288390486 B 0.4343750355 C -0.0302584569 D 0.3157774736 E . F -0.0003280062 G 0.0563677638 

I am aware that glmnetglmnet is ignoring the influence of my random effects and that other packages such as glmmLASSOglmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model (removing variable E and F) with AIC, and choose the best fitting model based on this.

 df AIC Null 2 57723.75 Full 17 34059.04 Shrunk 15 36452.12 
 df AIC Null 2 57723.75 Full 17 34059.04 Shrunk 15 36452.12 

However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. After variable/model selection I am then using ggpredictggpredict on the final GLMM to summarise the predicted probability of use.

x <- model.matrix(Use ~ Habitat + Season + A + B + C + D + E + F + G, data = df)[, -1] y <- df$Use # must be numeric (0/1) # Fit LASSO Lasso_fit <- glmnet(x, y, alpha = 1, family = "binomial", standardize = TRUE) # Plot coefficient paths plot(Lasso_fit, xvar = "lambda", label = TRUE) plot(Lasso_fit, xvar = "dev", label = TRUE) # Cross-validation to find optimal lambda set.seed(123) cv_lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class") plot(cv_lasso) best_lambda <- cv_lasso$lambda.min # Extract coefficients at best lambda coef(cv_lasso, s = "lambda.min") 
> coef(cv_lasso, s = "lambda.min") 15 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -4.3742763020 habB 0.3448104589 habC -0.0012576147 habD 0.5056425623 habE -0.1176552784 SeasonSpring -0.0660149736 SeasonSummer . SeasonWinter . A 1.3288390486 B 0.4343750355 C -0.0302584569 D 0.3157774736 E . F -0.0003280062 G 0.0563677638 

I am aware that glmnet is ignoring the influence of my random effects and that other packages such as glmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model (removing variable E and F) with AIC, and choose the best fitting model based on this.

 df AIC Null 2 57723.75 Full 17 34059.04 Shrunk 15 36452.12 

However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. After variable/model selection I am then using ggpredict on the final GLMM to summarise the predicted probability of use.

# convert to matrix x <- model.matrix(Use ~ Habitat + Season + A + B + C + D + E + F + G, data = df)[, -1] y <- df$Use # must be numeric (0/1) # Fit LASSO Lasso_fit <- glmnet(x, y, alpha = 1, family = "binomial", standardize = TRUE) # Plot coefficient paths plot(Lasso_fit, xvar = "lambda", label = TRUE) plot(Lasso_fit, xvar = "dev", label = TRUE) # Cross-validation to find optimal lambda set.seed(123) cv_lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class") plot(cv_lasso) best_lambda <- cv_lasso$lambda.min # Extract coefficients at best lambda coef(cv_lasso, s = "lambda.min") 
> coef(cv_lasso, s = "lambda.min") 15 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -4.3742763020 habB 0.3448104589 habC -0.0012576147 habD 0.5056425623 habE -0.1176552784 SeasonSpring -0.0660149736 SeasonSummer . SeasonWinter . A 1.3288390486 B 0.4343750355 C -0.0302584569 D 0.3157774736 E . F -0.0003280062 G 0.0563677638 

I am aware that glmnet is ignoring the influence of my random effects and that other packages such as glmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model (removing variable E and F) with AIC, and choose the best fitting model based on this.

 df AIC Null 2 57723.75 Full 17 34059.04 Shrunk 15 36452.12 

However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. After variable/model selection I am then using ggpredict on the final GLMM to summarise the predicted probability of use.

Added code sections and info to methods
Source Link
CjC
  • 21
  • 2

As a bit of background on my data set,: I will be running a GLMM model on use vs availability GPS data (41,636 total obs) andfor perching locations of birds (ID = random effect). I have 10 predictor variables I am interestedwhich have been chosen based on previous research and are likely to influence perch site selection for that species. However, previous research has not been completed in my study area so it is likely that not all of these variables are influencing perch site selection for my birds. Using

So far I have used the glmnet package I haveto run lasso shrinkage to determine what predictors should be kept. Here's the code used, an example from a slightly different dataset but similar:

x <- model.matrix(Use ~ Habitat + Season + A + B + C + D + E + F + G, data = df)[, -1] y <- df$Use # must be numeric (0/1) # Fit LASSO Lasso_fit <- glmnet(x, y, alpha = 1, family = "binomial", standardize = TRUE) # Plot coefficient paths plot(Lasso_fit, xvar = "lambda", label = TRUE) plot(Lasso_fit, xvar = "dev", label = TRUE) # Cross-validation to find optimal lambda set.seed(123) cv_lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class") plot(cv_lasso) best_lambda <- cv_lasso$lambda.min # Extract coefficients at best lambda coef(cv_lasso, s = "lambda.min") 

The results were:

> coef(cv_lasso, s = "lambda.min") 15 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -4.3742763020 habB 0.3448104589 habC -0.0012576147 habD 0.5056425623 habE -0.1176552784 SeasonSpring -0.0660149736 SeasonSummer . SeasonWinter . A 1.3288390486 B 0.4343750355 C -0.0302584569 D 0.3157774736 E . F -0.0003280062 G 0.0563677638 

I am aware that glmnet is ignoring the influence of my random effects and that other packages such as glmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model (removing variable E and F) with AIC, and choose the best fitting model based on this.

 df AIC Null 2 57723.75 Full 17 34059.04 Shrunk 15 36452.12 

However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. After variable/model selection I am then using ggpredict on the final GLMM to summarise the predicted proabilityprobability of use.

I wasguess I am therefore doing more "exploratory" analysis to see what variables influence my study population and I want to complete variable selection to find the most important variables influencing perch site selection (while avoiding any 'data mining').

I was hoping someone could provide some guidance on if my method is appropriate and the best way to go about variable selection if not. I have spent time looking at the published literature but have struggled to fully understand the different methods having not come from a maths/stats background.

As a bit of background on my data set, I will be running a GLMM model on use vs availability data (41,636 total obs) and I have 10 predictor variables I am interested in. Using the glmnet package I have run lasso shrinkage to determine what predictors should be kept. I am aware that glmnet is ignoring the influence of my random effects and that other packages such as glmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model with AIC, and choose the best fitting model based on this. However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. I am then using ggpredict on the final GLMM to summarise the predicted proability of use.

I was therefore hoping someone could provide some guidance on if my method is appropriate and the best way to go about variable selection if not. I have spent time looking at the published literature but have struggled to fully understand the different methods having not come from a maths/stats background.

As a bit of background on my data set: I will be running a GLMM model on use vs availability GPS data (41,636 total obs) for perching locations of birds (ID = random effect). I have 10 predictor variables which have been chosen based on previous research and are likely to influence perch site selection for that species. However, previous research has not been completed in my study area so it is likely that not all of these variables are influencing perch site selection for my birds.

So far I have used the glmnet package to run lasso shrinkage to determine what predictors should be kept. Here's the code used, an example from a slightly different dataset but similar:

x <- model.matrix(Use ~ Habitat + Season + A + B + C + D + E + F + G, data = df)[, -1] y <- df$Use # must be numeric (0/1) # Fit LASSO Lasso_fit <- glmnet(x, y, alpha = 1, family = "binomial", standardize = TRUE) # Plot coefficient paths plot(Lasso_fit, xvar = "lambda", label = TRUE) plot(Lasso_fit, xvar = "dev", label = TRUE) # Cross-validation to find optimal lambda set.seed(123) cv_lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class") plot(cv_lasso) best_lambda <- cv_lasso$lambda.min # Extract coefficients at best lambda coef(cv_lasso, s = "lambda.min") 

The results were:

> coef(cv_lasso, s = "lambda.min") 15 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -4.3742763020 habB 0.3448104589 habC -0.0012576147 habD 0.5056425623 habE -0.1176552784 SeasonSpring -0.0660149736 SeasonSummer . SeasonWinter . A 1.3288390486 B 0.4343750355 C -0.0302584569 D 0.3157774736 E . F -0.0003280062 G 0.0563677638 

I am aware that glmnet is ignoring the influence of my random effects and that other packages such as glmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model (removing variable E and F) with AIC, and choose the best fitting model based on this.

 df AIC Null 2 57723.75 Full 17 34059.04 Shrunk 15 36452.12 

However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. After variable/model selection I am then using ggpredict on the final GLMM to summarise the predicted probability of use.

I guess I am therefore doing more "exploratory" analysis to see what variables influence my study population and I want to complete variable selection to find the most important variables influencing perch site selection (while avoiding any 'data mining').

I was hoping someone could provide some guidance on if my method is appropriate and the best way to go about variable selection if not. I have spent time looking at the published literature but have struggled to fully understand the different methods having not come from a maths/stats background.

Source Link
CjC
  • 21
  • 2

How to use LASSO shrinkage methods using glmnet for a GLMM model

I am fairly new to more complex statistics and I'm trying to get my head round appropriate variable selection methods including Lasso shrinkage, so would really appreciate any help and guidance offered.

As a bit of background on my data set, I will be running a GLMM model on use vs availability data (41,636 total obs) and I have 10 predictor variables I am interested in. Using the glmnet package I have run lasso shrinkage to determine what predictors should be kept. I am aware that glmnet is ignoring the influence of my random effects and that other packages such as glmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model with AIC, and choose the best fitting model based on this. However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. I am then using ggpredict on the final GLMM to summarise the predicted proability of use.

I was therefore hoping someone could provide some guidance on if my method is appropriate and the best way to go about variable selection if not. I have spent time looking at the published literature but have struggled to fully understand the different methods having not come from a maths/stats background.