2
$\begingroup$

I am fairly new to more complex statistics and I'm trying to get my head round appropriate variable selection methods including Lasso shrinkage, so would really appreciate any help and guidance offered.

As a bit of background on my data set: I will be running a GLMM model on use vs availability GPS data (41,636 total obs) for perching locations of birds (ID = random effect). I have 10 predictor variables which have been chosen based on previous research and are likely to influence perch site selection for that species. However, previous research has not been completed in my study area so it is likely that not all of these variables are influencing perch site selection for my birds.

So far I have used the glmnet package to run lasso shrinkage to determine what predictors should be kept. Here's the code used, an example from a slightly different dataset but similar:

# convert to matrix x <- model.matrix(Use ~ Habitat + Season + A + B + C + D + E + F + G, data = df)[, -1] y <- df$Use # must be numeric (0/1) # Fit LASSO Lasso_fit <- glmnet(x, y, alpha = 1, family = "binomial", standardize = TRUE) # Plot coefficient paths plot(Lasso_fit, xvar = "lambda", label = TRUE) plot(Lasso_fit, xvar = "dev", label = TRUE) # Cross-validation to find optimal lambda set.seed(123) cv_lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "class") plot(cv_lasso) best_lambda <- cv_lasso$lambda.min # Extract coefficients at best lambda coef(cv_lasso, s = "lambda.min") 

The results were:

> coef(cv_lasso, s = "lambda.min") 15 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) -4.3742763020 habB 0.3448104589 habC -0.0012576147 habD 0.5056425623 habE -0.1176552784 SeasonSpring -0.0660149736 SeasonSummer . SeasonWinter . A 1.3288390486 B 0.4343750355 C -0.0302584569 D 0.3157774736 E . F -0.0003280062 G 0.0563677638 

I am aware that glmnet is ignoring the influence of my random effects and that other packages such as glmmLASSO may be more appropriate, however due to my large dataset it’s too computationally intense. I have therefore decided to compare a null model, a full model (with all variables), and a lasso shrunk model (removing variable E and F) with AIC, and choose the best fitting model based on this.

 df AIC Null 2 57723.75 Full 17 34059.04 Shrunk 15 36452.12 

However, I have noticed that the full model is nearly always the best fitting model based on AIC and I don’t fully understand why this may be. After variable/model selection I am then using ggpredict on the final GLMM to summarise the predicted probability of use.

I guess I am therefore doing more "exploratory" analysis to see what variables influence my study population and I want to complete variable selection to find the most important variables influencing perch site selection (while avoiding any 'data mining').

I was hoping someone could provide some guidance on if my method is appropriate and the best way to go about variable selection if not. I have spent time looking at the published literature but have struggled to fully understand the different methods having not come from a maths/stats background.

$\endgroup$
5
  • $\begingroup$ Welcome. Just a thought here. Depending on the structure of your random effects, there may already be some regularization in your mixed model that lasso can’t improve on. $\endgroup$ Commented Aug 27 at 16:01
  • 3
    $\begingroup$ "However, I have noticed that the full model is nearly always the best fitting model based on AIC " with 40k observations and only 10 variables, it is not really surprising that any variable selection method based on prediction performance/fit would keep them all in the model, particularly AIC with its lighter complexity penalty. How did you get glmnet to produce a model with nonzero coefs? Using CV to pick the penalty strength, or just manually adjusting it until you got the desired number of variables? $\endgroup$ Commented Aug 27 at 17:52
  • $\begingroup$ Thank you for your comments, they've given me lots to think about. @NathanWycoff, I used cv to pick the penalty strength for the lasso and then used glmmTMB to produce my glmm model. Is it better to stick with one method then, so choose either the lasso and highlight the potential issues around it being for glm, or switch to something like the MuMIn package and use AIC? $\endgroup$ Commented Aug 27 at 19:41
  • 1
    $\begingroup$ @CjC Oh interesting; the CV-based lasso really did pick only some strict subset of parameters? That's surprising to me. Anyhow, you'll have to tell us just a bit more about your goals for us to give you a definite answer. Why do you want to do variable selection? In order to get a simpler model containing fewer terms? If so, this is in tension with the way you are currently performing variable selection, which is instead geared towards maximal predictive accuracy (glmnet+CV) or based on searching for a "true" model (AIC). $\endgroup$ Commented Aug 27 at 20:38
  • $\begingroup$ @NathanWycoff, thank you again for your comments, I have updated my question above with some further information and code which hopefully explains a bit further my aims of the research. Ultimatley, I want to find the most important variables for site selection and remove any unnecessary variables (especially due to the large dataset). Currently I am just using ggpredict after model fitting but the results may be used in the future to create a predictive map of suitable perch sites in the study area. $\endgroup$ Commented Aug 28 at 8:42

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.