0
$\begingroup$

In a model aimed to assess the influence of land use measures on ecosystem functioning, I have one log-transformed dependent variable (the ecosystem function), and 5 fixed-effects independent variables (1 continuous, 3 binary, 1 categorical).

I've fitted a mixed effects model (with two random effects, spatial and temporal, that remain the same) and now I am trying to find the best model. Initially I wanted to do this by manual stepwise elimination of insignificant variables using the anova(model) command and then comparing likelihood-ratio tests between models. The first model looks like this:

> model0 <- lme(dep_var ~ bin_1 + bin_2 + bin_3 + log(cont_1) + cat_1, random=~1|season/site, method="ML", data=abvpp2) > anova(model0) numDF denDF F-value p-value (Intercept) 1 319 9384.037 <.0001 bin_1 1 72 2.972 0.0890 bin_2 1 72 0.007 0.9338 bin_3 1 72 12.423 0.0007 log(cont_1 + 1) 1 72 1.655 0.2023 cat_1 2 72 29.382 <.0001 

After dropping the least significant variable bin_2:

> model1 <- lme(dep_var ~ bin_1 + bin_3 + log(cont_1) + cat_1, random=~1|season/site, method="ML", data=abvpp2) > anova(model1) numDF denDF F-value p-value (Intercept) 1 319 9257.012 <.0001 bin_1 1 73 2.931 0.0911 bin_3 1 73 12.260 0.0008 log(cont_1 + 1) 1 73 1.616 0.2077 cat_1 2 73 28.361 <.0001 > anova(model0,model1) Model df AIC BIC logLik Test L.Ratio p-value model0 1 10 517.3610 557.2506 -248.6805 model1 2 9 516.6558 552.5564 -249.3279 1 vs 2 1.29476 0.2552 

Pretty straightforward up to here. But after dropping cont_1 next, I get this:

> model2 <- lme(dep_var ~ bin_1 + bin_3 + cat_1, random=~1|season/site, method="ML", data=abvpp2) > anova(model1,model2) Model df AIC BIC logLik Test L.Ratio p-value model1 1 9 516.6558 552.5564 -249.3279 model2 2 8 522.7970 554.7087 -253.3985 1 vs 2 8.14122 0.0043 

cont_1 does seem to have some explanatory power, in spite of it's insignificance in the anova. Apparently, this is due to multicollinearity between the variables, since the order of fitting them in the model changes significances in the anova output drastically (sometimes insignificant variables even become significant when fittet AFTER another variable with usually more explanatory power). I am now a little unsure about how to select my model. Even though some variables are correlated (r=-0.33 and r=0.62), they explain different measures and dropping one of them beforehand would be hard to justify. I would be very grateful for some ideas on model selection while dealing with collinearity in LME models and avoiding stepwise commands.

$\endgroup$

1 Answer 1

1
$\begingroup$

Since you have a small number of variables, you could actually do all possible regressions and choose the best one (AIC or BIC criterion). The leaps package in R does this when there are no random effects. There's probably something out there that can do it with random effects as well. Worst case, you write your own function and loop your way through the $2^5$ options.

But as an old-fashioned statistician, I have to make the following points:

  • A choice based on expert knowledge of what actually matters in this situation is preferable to a data driven answer.
  • If the purpose of the study was to determine which variables matter, and if you have control over the independent variable, it would be a good idea to select a balanced design so the impact of each variable and its interactions can shine forth.
$\endgroup$
1
  • $\begingroup$ Thank you for the quick answer! I will try the leaps package. Indeed the purpose is to determine which/if variables matter, which is why I have a hard time solely relying on the AIC, since in some cases it means excluding variables with a significant F-statistic (even though mostly with coefficients not different from zero, when looking at the summary). Unfortunately, a balanced design is not possible. $\endgroup$ Commented Aug 6, 2015 at 13:08

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.