I am fairly new to more complex statistics and I'm trying to get my head round appropriate variable selection methods including Lasso shrinkage, so would really appreciate any help and guidance offered.
As a bit of background on my data set: I will be running a GLMM model on use vs availability GPS data (41,636 total obs) for perching locations of birds (ID = random effect). I have 10 predictor variables which have been chosen based on previous research and are likely to influence perch site selection for that species. However, previous research has not been completed in my study area so it is likely that not all of these variables are influencing perch site selection for my birds.
So far I have used the glmnet package to run lasso shrinkage to determine what predictors should be kept. Here's the code used, an example from a slightly different dataset but similar:
# convert to matrix x <- model.matrix(Use ~ Habitat + Season + A + B + C + D + E + F + G, data = df)[, -1] y <- df$Use # must be numeric (0/1) # Fit LASSO Lasso_fit <- glmnet(x, y, alpha = 1, family = "binomial", standardize = TRUE) # Plot coefficient paths plot(Lasso_fit, xvar = "lambda", label = TRUE) plot(Lasso_fit, xvar = "dev", label = TRUE) # Cross-validation to find optimal lambda set.seed(123) cv_lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class") plot(cv_lasso) best_lambda <- cv_lasso$lambda.min # Extract coefficients at best lambda coef(cv_lasso, s = "lambda.min") The results were:
> coef(cv_lasso, s = "lambda.min") 15 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -4.3742763020 habB 0.3448104589 habC -0.0012576147 habD 0.5056425623 habE -0.1176552784 SeasonSpring -0.0660149736 SeasonSummer . SeasonWinter . A 1.3288390486 B 0.4343750355 C -0.0302584569 D 0.3157774736 E . F -0.0003280062 G 0.0563677638 I am aware that glmnet is ignoring the influence of my random effects and that other packages such as glmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model (removing variable E and F) with AIC, and choose the best fitting model based on this.
df AIC Null 2 57723.75 Full 17 34059.04 Shrunk 15 36452.12 However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. After variable/model selection I am then using ggpredict on the final GLMM to summarise the predicted probability of use.
I guess I am therefore doing more "exploratory" analysis to see what variables influence my study population and I want to complete variable selection to find the most important variables influencing perch site selection (while avoiding any 'data mining').
I was hoping someone could provide some guidance on if my method is appropriate and the best way to go about variable selection if not. I have spent time looking at the published literature but have struggled to fully understand the different methods having not come from a maths/stats background.