Accidentally including the same variable twice in a regression model?

Question

I have this mixed effects GAM regression in R (mgcv):

library(mgcv) gam_beta <- gam( y ~ te(time, x1) + te(time, x2) + s(time, by = city) + s(city, bs = "re"), data = my_data, method = "REML", family = betar(link = "logit") )

I tried to write the equation for this model (based on my understanding of tensor product functions te in `mgcv https://www.rdocumentation.org/packages/mgcv/versions/1.9-1/topics/te):

$$ y_{ij} \sim \text{Beta}(\mu_{ij}\phi, (1-\mu_{ij})\phi) $$ $$\mu_{ij} = \frac{1}{1+e^{-\eta_{ij}}}$$ $$ \text{logit}(\mu_{ij}) = \eta_{ij}= \beta_0 + f_{12}(time_{ij}, x1_{ij}) + f_{34}(time_{ij}, x2_{ij}) + h_i(time_{ij}) + b_i $$

I tried to expand this a bit more:

$$ \eta_{ij} = \underbrace{\sum_{k=1}^{K_1} \hat{\gamma}_{1k}f_{1k}(time_{ij}) + \sum_{l=1}^{L_1} \hat{\gamma}_{2l}g_{1l}(x1_{ij}) + \sum_{k=1}^{K_1}\sum_{l=1}^{L_1} \hat{\gamma}_{3kl}f_{1k}(time_{ij})g_{1l}(x1_{ij})}_{\text{first te(): time and x1 terms}} + $$

$$ \underbrace{\sum_{m=1}^{M_1} \hat{\gamma}_{4m}f_{2m}(time_{ij}) + \sum_{n=1}^{N_1} \hat{\gamma}_{5n}g_{2n}(x2_{ij}) + \sum_{m=1}^{M_2}\sum_{n=1}^{N_2} \hat{\gamma}_{6mn}f_{2m}(time_{ij})g_{2n}(x2_{ij})}_{\text{second te(): time and x2 terms}} + $$

$$ \underbrace{\sum_{p=1}^P \hat{\alpha}_{ip}h_p(time_{ij})}_{\text{city-specific smooth}} + \underbrace{\hat{b}_i}_{\text{random effect}} $$

Here is my question: I know that the te function in mgcv` includes both the main effects and the interaction effects (GAM Regression: Interactions vs Main Effects?). But won't this mean that there is a risk of a certain term being double counted and appearing twice in the GAM equation?

For example, I have te(time,x1) and te(time,x2). This is because I wanted to include time interactions with covariates x1 and x2 in my model. But will this result in the main effects of time being included twice? Or does mgcv recognize this and only include this once?

I think that the s function in mgcv only includes the term once? e.g. s(time, city) vs te(time,x1) ... I think te(time,x1) includes the main effects of time, the main effects of x1 and the interaction of time and x1.... whereas the s function does not do this. — stats_noob
– stats_noob, Commented Dec 16, 2024 at 8:21
That's (your comment) is not correct; te() includes the main effects and interaction — Gavin Simpson
– Gavin Simpson, Commented Dec 16, 2024 at 10:50
Hi Gavin, thanks for the feedback. Can you please write the mathematical equation corresponding to " gam_beta <- gam( y ~ te(time, x1) + te(time, x2) + s(time, by = city) + s(city, bs = "re"), data = my_data, method = "REML", family = betar(link = "logit") )" if you have time? — stats_noob
– stats_noob, Commented Dec 16, 2024 at 15:09

Nick Cox · Accepted Answer · 2024-12-16 17:14:58Z

On its own, your model is not identifiable: the same terms are included in the model twice. In your example they are included at least 2 times and the same basis is included 2 + nlevels(city) times because of the factor by smooth:

te(time, x1) + te(time, x2) + s(time, by = city)

Having said that, mgcv does try to identify terms that are linearly dependent, so, often, you can estimate the model's parameters, and this is largely due to the penalization in the parameters and identifiability constraints. Usually there is a price to pay, however, in that the standard errors of the terms are larger than they need be.

I'm not sure why writing out the mathematical equation for the model would help here. What you need to understand is that the basis for te(time, x1) and that for te(time, x2) span the space of functions that would represent the main effects of time. So yes, there is a risk that a model of the form you describe contains the same terms multiple times. While "double accounting" is not the issue, this is a problem because where should the model ascribe variation due to time? Hence the lack of identifiability.

As you've been told before, the correct way to do what you are doing is to use the ti() smooths, where the span of functions in the basis of a ti(x,z) term is orthogonal to the span of functions representing the smooth main effects of x and z. Additionally, I would suggest a different representation for your $f_{\mathtt{city}}(\mathtt{time}_i)$ effects, to also avoid the identifiability issues with the main effects of smooth of time:

ti(time) + ti(x1) + ti(x2) + ti(time, x1) + ti(time, x2) + s(time, city, bs = "sz")

You will want to set k on all of those smooths and choose the basis (using argument bs, except for the s() term, where you will need to use xt = list(bs = "foo"), replacing "foo" with the required basis type. This is because s() and ti() use different defaults for the basis type, bs, and the dimension of the basis, k.

1) My original R code had a random effect on city : s(city, bs = "re") . I see that your code, you have removed the random effects? is there any reason? is it ok for me to put random effects here? (I have a feeling that perhaps bs = "sz" is somehow a random effect)? — stats_noob
– stats_noob, Commented Dec 17, 2024 at 3:58
2) in your r code, you use ti(x1) and ti(x2). will there be a difference if instead I use s(x1) and s(x2)? I ask this question because ti() is supposed to be an interaction, but I only variable in each ti() ... does this mean interaction with 1? — stats_noob
– stats_noob, Commented Dec 17, 2024 at 3:58
@user_436830 Have you read this paper? peerj.com/articles/6876 — Roland
– Roland, Commented Dec 17, 2024 at 6:24
Hi Roland, thank you for the paper reference I will check it out. Do you have any opinions on these 2 comments I posted here? — stats_noob
– stats_noob, Commented Dec 17, 2024 at 14:28

Stack Exchange Network

Accidentally including the same variable twice in a regression model?

1 Answer 1

Linked

Hot Network Questions

Accidentally including the same variable twice in a regression model?

1 Answer 1

Linked

Related

Hot Network Questions