1
$\begingroup$

I am conducting my master's thesis on the temporal patterns of spore production in two specific species and the environmental drivers associated with these patterns. I began with a visual analysis of sporulation over time to identify possible trends. For one species, no clear temporal pattern emerged, whereas the other showed a distinct Gaussian-shaped curve peaking in the summer months.

My next step is to perform a statistical analysis to determine in which season we can expect higher sporulation values. Specifically, I plan to model sporulation ~ season + (1 | site) for both species to verify whether the patterns observed in the plots are supported statistically.

Following that, I intended to build an environmental model of the form sporulation ~ climatic variables + (1 | site). However, I have received feedback suggesting that time should be included, as it may act as a confounding variable influencing both the response and the predictors. I am uncertain about this, since my primary research question focuses on identifying which environmental predictors affect sporulation. Including time might remove part of the variability I am trying to explain, but excluding it might risk bias due to confounding. I am currently evaluating how best to handle this issue.

$\endgroup$
8
  • 2
    $\begingroup$ Welcome to CV. Please tell us more. How many years data have you got? Have you got daily measurements? Weekly? Or what? //Also, looking at your data before making models can cause problems; with p values and CIs and such have you split your data into train and test? $\endgroup$ Commented Aug 8 at 9:34
  • 1
    $\begingroup$ The responsen variable quantity is measured in DNA copies/µL per day. Quantity values have an intervals of approximately 7 days, although the exact spacing varies. The dataset includes samples collected in 2022 from six sites, with each site having around 50 samples on average. However, the number of samples is not equal across all sites due to sampling process errors. Additionally this is not a prediction task, but about causal relationships. Therefore I haven't split the data into train and test $\endgroup$ Commented Aug 8 at 13:25
  • 1
    $\begingroup$ Train/test splitting is useful for your purposes, too, Lara. After all, when your model cannot accurately estimate the responses, to what extent can you trust its parameter estimates? $\endgroup$ Commented Aug 8 at 15:42
  • $\begingroup$ Adding to what @whuber train/test helps avoid the problems of overfitting and of the "sharpshooter problem" (i.e the guy who first shot his rifle at a barn and then painted targets around the holes). I'm not sure what it has to do with prediction vs. not prediction. $\endgroup$ Commented Aug 8 at 16:00
  • $\begingroup$ Then how would you proceed with this? I never did this kind of approach. Can I still use the model summary to identify which variable is associated with higher quantity values? $\endgroup$ Commented Aug 8 at 17:18

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.