The following question builds on the discussion found on this page. Given a response variable y, a continuous explanatory variable x and a factor fac, it is possible to define a General Additive Model (GAM) with an interaction between x and fac using the argument by=. According to the help file ?gam.models in the R package mgcv, this can be accomplished as follows:
gam1 <- gam(y ~ fac +s(x, by = fac), ...) @GavinSimpson here suggests a different approach:
gam2 <- gam(y ~ fac +s(x) +s(x, by = fac, m=1), ...) I have been playing around with a third model:
gam3 <- gam(y ~ s(x, by = fac), ...) My main questions are: are some of these models just wrong, or are they simply different? In the latter case, what are their differences? Based on the example that I am going to discuss below I think I could understand some of their differences, but I am still missing something.
As an example I am going to use a dataset with color spectra for flowers of two different plant species measured at different locations.
rm(list=ls()) # install.packages("RCurl") library(RCurl) # allows accessing data from URL df <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt")) library(mgcv) These are the mean color spectra at the locality level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
My final goal is to model the (potentially interactive) effect of Taxon and wavelength wl on % reflectance (referred to as density in the code and dataset) while accounting for Locality as a random effect in a mixed-effect GAM. For the moment I won't add the mixed effect part to my plate, which is already full enough with trying to understand how to model interactions.
I'll start with the simplest of the three interactive GAMs:
gam.interaction0 <- gam(density ~ s(wl, by = Taxon), data = df) # common intercept, different slopes plot(gam.interaction0, pages=1) summary(gam.interaction0) Produces:
Family: gaussian Link function: identity Formula: density ~ s(wl, by = Taxon) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.3490 0.1693 167.4 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(wl):TaxonSpeciesA 8.938 8.999 884.3 <2e-16 *** s(wl):TaxonSpeciesB 8.838 8.992 325.5 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.523 Deviance explained = 52.4% GCV = 284.96 Scale est. = 284.42 n = 9918 The parametric part is the same for both species, but different splines are fitted for each species. It is a bit confusing to have a parametric part in the summary of GAMs, which are non-parametric. @IsabellaGhement explains:
If you look at the plots of the estimated smooth effects (smooths) corresponding to your first model, you will notice that they are centered about zero. So you need to 'shift' those smooths up (if the estimated intercept is positive) or down (if the estimated intercept is negative) to obtain the smooth functions you thought you were estimating. In other words, you need to add the estimated intercept to the smooths to get at what you really want. For your first model, the 'shift' is assumed to be the same for both smooths.
Moving on:
gam.interaction1 <- gam(density ~ Taxon +s(wl, by = Taxon, m=1), data = df) plot(gam.interaction1,pages=1) summary(gam.interaction1) Gives:
Family: gaussian Link function: identity Formula: density ~ Taxon + s(wl, by = Taxon, m = 1) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 40.3132 0.1482 272.0 <2e-16 *** TaxonSpeciesB -26.0221 0.2186 -119.1 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(wl):TaxonSpeciesA 7.978 8 2390 <2e-16 *** s(wl):TaxonSpeciesB 7.965 8 879 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.803 Deviance explained = 80.3% GCV = 117.89 Scale est. = 117.68 n = 9918 Now, each species also have its own parametric estimate.
The next model is the one that I have trouble understanding:
gam.interaction2 <- gam(density ~ Taxon + s(wl) + s(wl, by = Taxon, m=1), data = df) plot(gam.interaction2, pages=1) I have no clear idea of what these graphs represent.
summary(gam.interaction2) Gives:
Family: gaussian Link function: identity Formula: density ~ Taxon + s(wl) + s(wl, by = Taxon, m = 1) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 40.3132 0.1463 275.6 <2e-16 *** TaxonSpeciesB -26.0221 0.2157 -120.6 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(wl) 8.940 8.994 30.06 <2e-16 *** s(wl):TaxonSpeciesA 8.001 8.000 11.61 <2e-16 *** s(wl):TaxonSpeciesB 8.001 8.000 19.59 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.808 Deviance explained = 80.8% GCV = 114.96 Scale est. = 114.65 n = 9918 The parametric part of gam.interaction2 is about the same as for gam.interaction1, but now there are three estimates for smooth terms, which I cannot interpret.
Thanks in advance to anyone who will take the time to help me understand the differences in the three models.





gam1plus something for theSampleIDeffect plus you need to do something about the non-constant variance problem; These data don't seem to be conditionally distributed Gaussian because of the lower bound. $\endgroup$densityis negative at small wavelengths for some observations. I also noted that you mention the data are % reflectance. Are those negativedensityvalues real? If so, how do they arise? Is it via some normalization? If these are real, that excludes the Tweedie family and the Gamma. If this is a true %, (question remains why are there some negative values) and assuming the negative values can be excluded (or something?), then a beta regressionfamily = betar()might be more appropriate. $\endgroup$