Skip to main content
7 of 8
added 64 characters in body
Marco Plebani
  • 1.3k
  • 1
  • 9
  • 26

Different ways of modelling interactions between continuous and categorical predictors in GAM

The following question builds on the discussion found on this page. Given a response variable y, a continuous explanatory variable x and a factor fac, it is possible to define a General Additive Model (GAM) with an interaction between x and fac using the argument by=. According to the help file ?gam.models in the R package mgcv, this can be accomplished as follows:

gam1 <- gam(y ~ fac +s(x, by = fac), ...) 

@GavinSimpson here suggests a different approach:

gam2 <- gam(y ~ fac +s(x) +s(x, by = fac, m=1), ...) 

I have been playing around with a third model:

gam3 <- gam(y ~ s(x, by = fac), ...) 

My main questions are: are some of these models just wrong, or are they simply different? In the latter case, what are their differences? Based on the example that I am going to discuss below I think I could understand some of their differences, but I am still missing something.

As an example I am going to use a dataset with color spectra for flowers of two different plant species measured at different locations.

rm(list=ls()) # install.packages("RCurl") library(RCurl) # allows accessing data from URL df <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt")) library(mgcv) 

For clarity, each line in the figure above represents the mean color spectrum predicted for each location with a separate GAM of form density~s(wl) based on samples of ~10 flowers. The gray areas represent 95% CI for each GAM.

My final goal is to model the (potentially interactive) effect of Taxon and wavelength wl on reflectance (referred to as density in the code and dataset) while accounting for Locality as a random effect in a mixed-effect GAM. For the moment I won't add the mixed effect part to my plate, which is already full enough with trying to understand how to model interactions.

I'll start with the simplest of the three interactive GAMs:

gam.interaction0 <- gam(density ~ s(wl, by = Taxon), data = df) # common intercept, different slopes plot(gam.interaction0, pages=1) 

enter image description here

summary(gam.interaction0) 

Produces:

Family: gaussian Link function: identity Formula: density ~ s(wl, by = Taxon) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.3490 0.1693 167.4 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(wl):TaxonSpeciesA 8.938 8.999 884.3 <2e-16 *** s(wl):TaxonSpeciesB 8.838 8.992 325.5 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.523 Deviance explained = 52.4% GCV = 284.96 Scale est. = 284.42 n = 9918 

The parametric part is the same for both species, but different splines are fitted for each species. It is a bit confusing to have a parametric part in the summary of GAMs, which are non-parametric. @IsabellaGhement explains:

If you look at the plots of the estimated smooth effects (smooths) corresponding to your first model, you will notice that they are centered about zero. So you need to 'shift' those smooths up (if the estimated intercept is positive) or down (if the estimated intercept is negative) to obtain the smooth functions you thought you were estimating. In other words, you need to add the estimated intercept to the smooths to get at what you really want. For your first model, the 'shift' is assumed to be the same for both smooths.

Moving on:

gam.interaction1 <- gam(density ~ Taxon +s(wl, by = Taxon, m=1), data = df) plot(gam.interaction1,pages=1) 

enter image description here

summary(gam.interaction1) 

Gives:

Family: gaussian Link function: identity Formula: density ~ Taxon + s(wl, by = Taxon, m = 1) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 40.3132 0.1482 272.0 <2e-16 *** TaxonSpeciesB -26.0221 0.2186 -119.1 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(wl):TaxonSpeciesA 7.978 8 2390 <2e-16 *** s(wl):TaxonSpeciesB 7.965 8 879 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.803 Deviance explained = 80.3% GCV = 117.89 Scale est. = 117.68 n = 9918 

Now, each species also have its own parametric estimate.

The next model is the one that I have trouble understanding:

gam.interaction2 <- gam(density ~ Taxon + s(wl) + s(wl, by = Taxon, m=1), data = df) plot(gam.interaction2, pages=1) 

enter image description here

I have no clear idea of what these graphs represent.

summary(gam.interaction2) 

Gives:

Family: gaussian Link function: identity Formula: density ~ Taxon + s(wl) + s(wl, by = Taxon, m = 1) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 40.3132 0.1463 275.6 <2e-16 *** TaxonSpeciesB -26.0221 0.2157 -120.6 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(wl) 8.940 8.994 30.06 <2e-16 *** s(wl):TaxonSpeciesA 8.001 8.000 11.61 <2e-16 *** s(wl):TaxonSpeciesB 8.001 8.000 19.59 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.808 Deviance explained = 80.8% GCV = 114.96 Scale est. = 114.65 n = 9918 

The parametric part of gam.interaction2 is about the same as for gam.interaction1, but now there are three estimates for smooth terms, which I cannot interpret.

Thanks in advance to anyone who will take the time to help me understand the differences in the three models.

Marco Plebani
  • 1.3k
  • 1
  • 9
  • 26