1
$\begingroup$

I am trying to simulate data based on two random effects and no fixed effects. The simulated data table contains nP projects and within each project nB Batches exists. Each Batch has nR replicates. Thus, I have a balanced design in the end. The random effect (i.e. standard deviation) over all Batches is sd_B. However, this standard deviation also varies over the different projects with sd_ranP. From my understanding, the random effects from Batch and Product are nested and thus should be estimated by a nested linear mixed model. I tried to model the simulated data by using the lme4 package in GNU R. However, I encounter singularity issues and question myself whether I have understood nested random effects correctly or if there is a mistake in my simulation.

What I tried so far is the following:

First, I define the number of Products, Batches and Replicates together with the corresponding standard deviations.

nP <- 25 #choose number of Products nB <- 25 #choose number of Batches nR <- 5 #choose number of Replicates sd_B <- 100 #Standard Deviation for Batch sd_ranP <- 10 #Random variability of SD Batch across the different Products sdRes <- 0.2 #Residual Standard Deviation 

Next, I sample the standard deviation for each batch over all products. Since the standard deviation of the batches change slightly from one product to the next, I consider this by with the nested random term sd_ranP (For simplification reason I assumed that the sampled sdB is also normally distributed, fully aware that standard deviations cannot be negative so maybe a log normal sampling would be more correct here. However, I have used values for sd_B and sd_ranP that no negative values for sdB occur.)

#sample the standard deviation of the batches by considering a nested random effect over the products sdB <- rnorm(n = nP, mean = sd_B, sd= sd_ranP) 

Afterwards, I sample from sdB, add some residual variance sdRes which has the same standard deviation across all products and batches and finally I create a data frame from this new y response vector.

#sample from sdB resB <- unlist(map(.x = sdB, ~rnorm(n = nB, mean = 0, sd = .x))) #create y and add sdRes y <- unlist(map2(.x = resB, .y = sdRes, ~rnorm(n = nR, mean = .x, sd = sdRes))) #create a data frame df <- data.frame(Product = as.factor(rep(1:nP, each = nB*nR)), Batch = as.factor(rep(1:(nB*nP), each = nR)), y = y) 

By using the following seed setseed(29072023) the simulate data visualized in ggplot2 show a different batch variability across the projects with a very small residual or within batch variability. This exactly what I would expect.
ggplot of the simulated data

Finally, I tried to fit the following model: model <- lmer(y ~ 1 + (1|Product/Batch), data = df) expecting estimated random effects for Batch:Product of around 100, Product of around 10 and Residual of around 0.2.

The interesting point is that in cases where I do not use a seed, in some cases I get exactly these values but in around half of the cases I end up with a singularity issue and a random effect for Product very close or equal to 0. My first thought was that the two random effects might be confounded, however in that case I would expect a singularity issue in all cases and not only in around 50% of the time.

To investigate this observation closer, I repeated the sampling process and model fitting 1000 times. Each time I saved the estimated random effect value for the Product in a vector and created a histogram of the results. I turned out, that in around 58% of the cases the estimated random effect for Product was below 0.1. However, if these values are neglected, the other 42% show exactly the distribution I would expect (the mean of the distribution is approximately at 10).

Histogramm for the random term Product based on 1000 repetitions

I hope someone can shed a bit light into my observation ;-)

Here the code for 1000 sampling repetitions.

 #load packages library(tidyverse) library(lme4) #outputs of for loop outm <- NULL nP <- 25 #choose number of Products nB <- 25 #choose number of Batches nR <- 5 #choose number of Replicates sd_B <- 100 #Standard Deviation for Batch sd_ranP <- 10 #Random variability of SD Batch across the different Products sdRes <- 0.2 #Residual Standard Deviation set.seed(29072023) for(i in 1:1000){ #sample the standard deviation of the batches by considering a nested random effect over the products sdB <- rnorm(n = nP, mean = sd_B, sd= sd_ranP) #sample from sdB resB <- unlist(map(.x = sdB, ~rnorm(n = nB, mean = 0, sd = .x))) #create y and add sdRes y <- unlist(map2(.x = resB, .y = sdRes, ~rnorm(n = nR, mean = .x, sd = sdRes))) #create a data frame df <- data.frame(Product = as.factor(rep(1:nP, each = nB*nR)), Batch = as.factor(rep(1:(nB*nP), each = nR)), y = y) model <- lmer(y ~ 1 + (1|Product/Batch), data = df) summary(model) outm <- rbind(outm, as.data.frame(summary(model)[["varcor"]])) } outm$sdcor <- as.numeric(outm$sdcor) test <- outm %>% dplyr::filter(grp == "Product") hist(test$sdcor) 
$\endgroup$
6
  • $\begingroup$ Welcome to cv, Marco! Did you do the simulation study in order to understand mixed models, or is there a real application behind it? I am asking because linear mixed models assume random effects with fixed variance. Estimation may perhaps be quite robust, but in 58% it breaks down - perhaps you reached a point where heteroscedasticity matters. Nice study, btw :-) Have you tried with fixed variance? $\endgroup$ Commented Jul 31, 2023 at 10:53
  • $\begingroup$ Thanks @Ute! There is a real application behind this simulation study and I try to figure out if LMMs are the right tool to address the problem. Since the heteroscedasticity we observe here is not function of another variable (e.g. variance increases with higher concentration) but a randomly changing variance around a "mean variance" I hoped that nested LMMs could be a promising tool. I tried also with crossed random effects (both variance components are fixed and Independent from each other) and didn't run into any issues. However, this does not map the situation I have in my real data. $\endgroup$ Commented Jul 31, 2023 at 11:51
  • $\begingroup$ Do you have estimates ("guestimates") for the distribution of variances? You could run your experiment with different variance distributions and check where it gives these results $\endgroup$ Commented Jul 31, 2023 at 12:25
  • $\begingroup$ Unfortunately not. Like in the simulation, I have several Projects each with nB Batches and nR replicates. The Batches have one common variance. However, this variance is randomly different between the Projects and this difference is not caused by the uncertainty in the batch variance estimation. In other words, even if I increase the number of batches towards infinite, this random variability of the batch variance across the projects would still be present and I have to consider it. $\endgroup$ Commented Jul 31, 2023 at 12:47
  • $\begingroup$ I could also be wrong. You modelled the sd as following an N(100, 10) distribution, so there is not really much variation (only +- 20% or so). On the other hand, sdRes is very small in comparison to sdB. So it is like you have replicates of the same value in each batch. Is this realistic? / / side question: do you already have data, or are you still planning? $\endgroup$ Commented Jul 31, 2023 at 14:15

1 Answer 1

1
$\begingroup$

The model that you simulated from is quite different from the model that you told lmer to fit. Your simulated data have no real random effect for Product, so in 58% of the cases, the estimated values correspond to the true values. Since you fit a model that postulates such a random effect, lmer will try to make up for such an effect, and in 42% of the cases, it is "lucky".

Simulated model and fitted model

You simulate from the model $$ M_{\text{simulated}}:\quad y_i = u_{\text{batch}(i)} +\varepsilon_i $$ where $$ \begin{aligned} \varepsilon_i&\sim N(0, 0.2^2),\\ u_{\text{batch}(i)}&\sim N(0, s^2_{\text{batch}(i)}),\\ s_{\text{batch}(i)} &= 100 + t_{\text{product}(i)},\\ t_{\text{product}(i)} &\sim N(0,10). \end{aligned} $$ Here $\text{batch}(i)$ and $\text{product}(i)$ give us the index of batch and product of item $i$. This is a quite complicated covariance structure; the standard deviations $s_{\text{batch}}$ do however not vary much (by about + - 20%). Therefore the simulated model could be approximated by a much simpler model, $$ M_{\text{simple}}:\quad y_i = u_{\text{batch}(i)} +\varepsilon_i, \quad \varepsilon_i\sim N(0,0.2^2),\ u_{\text{batch}(i)}\sim N(0, 100^2). $$ Try to fit this model to the simulated data via lmer(y ~ 1 + (1|Batch), data=df), and you will get very stable results for both the batch variance and the residual variance (there is no product variance involved)

The model that you fit by lmer(y ~ 1 + (1|Product/Batch), data = df) translates into $$M_{\text{fitted}}:\quad y_i = v_{\text{product}(i)} + u_{\text{batch}(i)}+\varepsilon_i, $$ where $$ \begin{aligned} \varepsilon_i&\sim N(0, 0.2^2),\\ u_{\text{batch}(i)}&\sim N(0, \sigma^2_{\text{batch}}),\\ v_{\text{product}(i)} &\sim N(0,\sigma^2_{\text{product}}). \end{aligned} $$ In 52% of the cases, lmer found that model $M_{\text{simple}}$ without term $v_{\text{product}}$ is the right fit. The model does not account for heteroscedasticity, but you correctly encoded nestedness.

Fitting heteroskedastic models

Usually, you would use function lme from package nlme to fit a heteroskedastic model. The syntax of lmer and lme are different; @BenBolker gives a hack to use lme4 for particular heteroskedastic models in this thread on stackoverflow. Stackoverflow would also be the right place to post questions concerning encoding a model in lme. The model should then be described in mathematical terms, or verbally, not only as simulation code ;-)

When heteroskedasticity in your real data is not very pronounced, you could safely fit a homoskedastic mixed model.
But means "not very pronounced"? - when comparing variances (or standard deviations), you would look at the ratio rather than the difference. You can always run a test of equal variances. A good way to check whether heteroskedasticity is harmful are simulations as the one you have done. Here is an interesting discussion on cv about violation of homoskedasticity, and using Bartlett's or Levene's test of homoskedasticity

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.