8
$\begingroup$

For my bachelor's thesis, I’m investigating the effect of voles and mulch on soil infiltration and saturated hydraulic conductivity (Ksat). I want to test the following three hypotheses:

  1. Vole activity significantly increases soil infiltration.

  2. Vole activity significantly increases saturated hydraulic conductivity.

  3. The combination of mulch and vole activity has a synergistic effect on infiltration and/or Ksat.

To test this, I have 16 plots in total:

8 plots with vole activity (4 of them with mulch, 4 without)

8 control plots without vole activity (4 with mulch, 4 without)

For each plot, I will collect:

One infiltration measurement One Ksat measurement Soil temperature Soil moisture

So for each outcome variable (Infiltration or Ksat), I’ll have 16 observations and 4 predictors (vole activity (yes/no), mulch(yes,no), temperature, moisture).

My professor suggested using linear regression to test the hypotheses. However, I’m concerned because I’ve read that you need at least 10 observations per predictor variable to get reliable estimates. So, my questions are:

Is it valid to perform linear regression with 4 predictors and only 16 samples?

Should I consider removing covariates (i.e. soil temperature/moisture) if they are not significantly correlated with the outcome variable?

Are there better methods than regression? It’s also possible that I’ll collect a second set of measurements if the variability of my samples for infiltration and/or Ksat is too high.

I’m relatively new to statistics, so any guidance or suggestions for how to approach this would be highly appreciated! I'm going to do this all in R.

My table will looks like this:

enter image description here

$\endgroup$

2 Answers 2

8
$\begingroup$

Welcome to CV!

You are correct to be concerned about your small sample size, but in your particular case (thanks for including the data), you got lucky.

You say that “[you]’ve read that you need at least 10 observations per predictor variable to get reliable estimates”. This is a rule of thumb, which, like all rules of thumbs, is generally good advice, but certainly could suffer exceptions.

The first thing to do, as always in statistics, is to plot you data (as we know that a picture is worth a thousand words). This is my attempt at it;

Interval Plots

The blue dots are the means, the grey dots are individual observations, and the bars are 95% CI’s for the means. And yes, I also plotted your covariates (Moisture and Temperature) as if they were outcomes; more on this later.

This tells us several things;

  1. Your tests of hypotheses are very likely to be (very!) significant; the CI’s of the means for Infiltration and Ksat show no overlap... This will greatly alleviate your issue with small sample size (i.e. it does not really matter which test you will run; it will be significant), because your effects are quite large.
  2. Moisture and Temperature seem to be highly associated with your predictors, and behave just like your outcomes (except for the sign of Temperature). And that is as we should expect; if Infiltration is higher due to Voles/Mulch, then we would expect Moisture content to also be higher. And if Mulch covers the soil, we would expect the Temperature to be lower. You have a case of multicollinearity; this is not a huge problem (see e.g. here) but with your small sample size it is more bothersome. So I would not use them as covariates, particularly since your hypotheses do not depend on them (and instead you could use them as other outcome variables).
  3. The small sample size can be easily perceived. You basically have only 4 clusters of observations for your model (with 4 observations per cluster). This indeed breaks the rule of thumb, but in your case, it is not too critical, since your effects are very large.

Which tests to use?
Regression? Sure, even with your small sample size, because of the large effect. You could do repeated measures of each plot (as you suggested); even better would be to have more plots (but likely not practical). But, again looking at the plots, the variance between measures at each combination of your predictors is so small compared to the variance introduced by the predictors (Mulch, Vole) that it will not make a large difference.
You could use t-tests, or ANOVA. But these are both special cases of regression, and they do not give you an idea of the interaction term.
You could use a DOE (Design of Experiment, aka Factorial ANOVA), which would give you information about the interaction term, but this is just a regression.

So yes, follow your professor’s advice.

One last comment. You will have 2 regressions (or maybe 4 if you decide to also regress Moisture and Temperature). The question then is “Should you use a multiple comparison correction?”. Note that opinions are divided on this (see e.g. here or here on CV). You at least should address the topic, no matter how you answer it (and again, your large effects will be your friends here).

$\endgroup$
5
  • $\begingroup$ Thanks for your reply, and sorry for the delay. I got new input from my professor: I have a 2×2 factorial design (vole × mulch). First, I need to check that variability in Infiltration and Ksat isn’t too high within groups, using plots. If it is, I’ll resample all plots. Also, if visual differences seem strong but aren’t statistically significant, I’ll resample. I tried factorial ANOVA and regression, but summary() outputs differ (because you said I could use either regression or anova). $\endgroup$ Commented Jul 30 at 18:45
  • $\begingroup$ @Faith, as I explained in my answer, and showed on the graphs, the variability within plots is very small compared to the variability between plots. And the visual differences are statistically significant (I ran the tests :-). So no need to resample. Yes, you have a 2x2 factorial design, and you can use the tools for that; if that is what your professor prefers, do it; it is just a regression, reframed, and will give you the answers you seek. If you frame the regression model correctly, factorial ANOVA and regression will give you the same exact numerical values...ctd $\endgroup$ Commented Jul 30 at 20:17
  • $\begingroup$ ...ctd. Depending on the software you use, the formats may be different, they may include different metrics, but they will share most metrics, and the numerical results will be identical. $\endgroup$ Commented Jul 30 at 20:17
  • $\begingroup$ Okay, thank you for your answer! :) I know that these data show only small variability within each group, but unfortunately, they are made-up data and not actual samples yet (I still need to collect those). When I run aov() and lm() in R to check for significant coefficients, I get different outputs when I use summary(aov()) versus summary(lm()). Some people say (factorial) ANOVA and (multiple) regression are different (e.g. due to degrees of freedom), while others say they’re essentially the same. I’m a bit confused. $\endgroup$ Commented Jul 31 at 7:10
  • $\begingroup$ @Faith, I did definitively get a feel that this was "fake" data, because the residual plots and all the summary results (and the raw data itself) were just "too good to be true" :-). Wrt lm and aov, the documentation for aov says "This provides a wrapper to lm for fitting linear models to balanced or unbalanced experimental designs"; so you should get the same results?? (but need to specify the excat same model in both) I am not usually working in R (fwiw, my software gave the same results), but you may want to have a look at DoE.base (R package for DoE). $\endgroup$ Commented Jul 31 at 16:51
6
$\begingroup$

I’m concerned because I’ve read that you need at least 10 observations per predictor variable to get reliable estimates.

It's good to be concerned. The danger with too few observations is that you overfit the data in a way that your model doesn't extend reliably to new cases.

The limitation isn't strictly about the number of "predictor variables." The concern is about how many coefficients you need to estimate from the data. In your case, evaluating "[t]he combination of mulch and vole activity" means examining an interaction term between mulch and vole, adding another coefficient to your model. That's 5 coefficients if you include temperature and moisture.

Is it valid to perform linear regression with 4 predictors and only 16 samples?

Rules of thumb like 10 (or 15) observations per coefficient are just that: rules of thumb that might not apply in any particular circumstance. Frank Harrell explains in Section 4.4 of Regression Modeling Strategies that these guidelines:

Assume typical problem in medicine, epidemiology, and the social sciences in which the signal:noise ratio is small (higher ratios allow for more aggressive modeling).

If the agreement among plots within each combination of mulch and vole is good enough, you might be OK.

Should I consider removing covariates (i.e. soil temperature/moisture) if they are not significantly correlated with the outcome variable?

It's best to pre-specify the model and not to use observed associations between predictors and outcomes to make that type of decision. See Regression Modeling Strategies for guidance, particularly Chapters 2 and 4.

In your case, I'd be worried anyway about including temperature and moisture in your model, as I suspect that those will be affected by mulch and vole and might be mediating the effects of mulch and vole. If you include them in your model, your estimates of the effects of mulch and vole would only be those that aren't mediated by temperature and moisture.

Are there better methods than regression?

Several other methods that you might have heard about (e.g., analysis of variance/covariance) are just special cases of regression models. There are "machine learning" methods that aren't regression, but they wouldn't seem to be very useful in your situation. They are mostly for cases with large numbers of cases and predictors where you want to learn associations from the data rather than to test pre-specified hypotheses like yours.

It’s also possible that I’ll collect a second set of measurements if the variability of my samples for infiltration and/or Ksat is too high.

More data can be helpful if they aren't too expensive to collect. You have to be careful, however, as making duplicate measurements for each plot (called "technical replicates") isn't the same as doubling the number of plots (called "biological replicates"). From the perspective of your model, you will still have only 4 plots per combination of mulch and vole, although (one hopes) with a more precise estimate of the true value within each plot.

You might find a Technical Perspective by the Pollards, Empowering statistical methods for cellular and molecular biologists, Molecular Biology of the Cell 30: 1359-1368 (2019), helpful. Although it's written from the perspective of cellular and molecular biology, it's an approachable summary of general principles for biological study design and analysis.

$\endgroup$
1
  • 3
    $\begingroup$ good answer; just a nit, but would it no be 6 coefficients if you count the intercept (aka constant term)? $\endgroup$ Commented Jul 27 at 16:55

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.