Is there a statistical analysis that can be used in R to assess significant differences between regions of a curve?

Question

I would like to know if there is a statistical analysis to determine differences between regions of a curve/regression across two treatments. I know there are several statistical analyses for determining if there are differences between two regressions/curves by a variable overall, but I don't know of any that can identify differences by region.

Let's say that X = Light_uE, Y = CellspermL, and Species is the treatment condition (which should be used to generate two different plots). How do I tell if there is a significant difference between, say, the curve at the Light_uE (X) conditions from 10-20 or 20-30 between Species? Again, not if the curves are different overall. This question seems similar to what I am seeking, only it examines the entire dataset rather than just a small segment, and uses a polynomial function rather than a piecewise regression. (I also don't think an ANOVA would be appropriate for this, and I would like to stick with ggplot when possible.)

I have provided a reproducible dataframe, as well as code to plot this data if you want to visualize it. Thanks for any help.

example_curves <- data.frame( Light_uE = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100), CellspermL = c(1000000, 2000000, 3000000, 4000000, 5000000, 4000000, 3000000, 2000000, 1000000, 500000, 2000000, 2500000, 3500000, 3600000, 4000000, 3900000, 3000000, 2000000, 1000000, 500000), Species = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B")) print(example_curves) # Piecewise plotting library(segmented) library(gridExtra) library(ggplot2) create_piecewise_plot <- function(data, title, initial_psi) { # Fit a linear model lm_model <- lm(CellspermL ~ Light_uE, data = data) # Apply segmented function with manually specified initial breakpoint seg_model <- segmented( lm_model, seg.Z = ~Light_uE, psi = initial_psi, # Manually specify the starting point control = seg.control(display = FALSE) # Disable extra output ) data$fitted_values <- fitted(seg_model) p <- ggplot(data, aes(x = Light_uE, y = CellspermL)) + geom_point() + geom_line(aes(y = fitted_values), color = 'blue') + labs( title = title, x = "Light (uE)", y = "Cells per mL" ) + theme_minimal() return(p) } # Creation of two different plots by Species data_species_a <- subset(example_curves, Species == "A") data_species_b <- subset(example_curves, Species == "B") range_a <- range(data_species_a$Light_uE) range_b <- range(data_species_b$Light_uE) # Ensure initial_psi values are within the range plot_species_a <- create_piecewise_plot(data_species_a, "Species A", initial_psi = 50) # Choose a valid value within range_a plot_species_b <- create_piecewise_plot(data_species_b, "Species B", initial_psi = 50) # Choose a valid value within range_b # Arrange the plots horizontally using grid.arrange grid.arrange(plot_species_a, plot_species_b, ncol = 2)

Can you elaborate on your actual research question? It's hard to tell from what's here if "is there a significant difference between, say, the curve at the Light_uE (X) conditions from 10-20 or 20-30 between Species" is actually your question, or if this is your proposed statistical measure to answer a different question. In the latter case a different method might be easier and more appropriate. — Zane Billings
– Zane Billings, Commented Jan 14 at 17:12
@wzbillings Essentially, I am trying to determine what effect, if any, species has on determining the number of cells at different light levels. To be more specific, I would like to determine if some function of species interacts with light differently across a light gradient, and thus is there a difference between number of cells across species within a given light range. A highly pointed example: "Does species B tolerate light better at lower levels (10-20), represented by population density (cells per mL), relative to species A?" — Zard22
– Zard22, Commented Jan 14 at 17:44

EdM · Accepted Answer · 2025-01-15 18:12:02Z

One approach would be to fit an appropriate model of CellspermL versus Light_uE, including an interaction with Species to allow for curves that differ between the Species, and then interrogate the model for Species differences at selected levels of Light_uE.

In this biological context, there's no reason to expect a segmented linear association between CellspermL and Light_uE. A smooth cubic regression spline can let the data show the form of the association.

Here's an example, using tools from Frank Harrell's rms package in R. It starts with your example_curves data frame.

library(rms) ## the next 2 lines provide information to the package's post-modeling tools ddec <- datadist(example_curves) options(datadist="ddec") ## use a restricted cubic spline with 3 knots for the smooth curve ## interact with Species to let the curves differ ols1 <- ols(CellspermL ~ rcs(Light_uE, 3)*Species, data = example_curves) ## examine modeled differences between Species A and B ## the following gives estimates for Species="A" minus Species="B" ## at Light_uE values of 15 and 25 contrast(ols1, a=list(Light_uE=c(15,25), Species="A"), b=list(Light_uE=c(15,25), Species="B")) # Light_uE Contrast S.E. Lower Upper t Pr(>|t|) # 1 15 -747336.7 287810.9 -1364629.7 -130043.58 -2.60 0.0211 # 2 25 -395216.5 209702.7 -844984.1 54551.06 -1.88 0.0804

The anova() function from the rms package applied to the ols1 model provides useful information about overall significance and that of the nonlinear and interaction terms (not shown).

Make sure to have pre-specified levels of Light_uE at which to compare the Species. If you look at the results to help choose the levels of Light_uE, you can severely bias the results.

I'd recommend that you do the analysis as close to the original observed data as possible. Your estimates of cells per milliliter presumably are based on smaller numbers of counts that were then adjusted for the sampling volume. If all sampling volumes are the same, work with the number of counts. If not, use an offset term to account for differences in sampling volumes while working with the observed counts as the outcome. If the observed counts are low, consider working with a Poisson or negative-binomial model appropriate to counts.

In response to comments

The above provides estimates at values of Light_uE values of 15 and 25, chosen to be at midpoints of two ranges of interest. It evaluates the differences in CellspermL levels between the two Species at those particular Light_uE values.

Once you have a regression model, you can evaluate estimated outcome differences among any linear combinations of predictor values, by applying the formula for the variance of a weighted sum of random variables to get the standard errors. The (co)variances of the parameter estimates are the model's variance-covariance matrix; the weights are the (differences among) the hypothesized predictor values.

That's what the contrast() function above does. If you use R functions outside the rms package, the emmeans package has tools that can handle a wide range of model types.

As an example continuing with the rms package, the following double-difference contrast evaluates the difference in linear slope of Light_uE values from 10 and 20, between the two Species (slope for "A" minus slope for "B"):

contrast(ols1, a=list(Light_uE=20,Species="A"), b=list(Light_uE=10,Species="A"), a2=list(Light_uE=20,Species="B"), b2=list(Light_uE=10,Species="B")) # Contrast S.E. Lower Upper t Pr(>|t|) # 11 352120.1 120417.6 93850.07 610390.2 2.92 0.0111 # # Error d.f.= 14 # # Confidence intervals are 0.95 individual intervals

That uses CellspermL values estimated from the restricted cubic spline fit. If you instead use a segmented linear spline as in your question, you could proceed similarly.

With some effort you could even go to more complex estimates of the "shape" of the fit. Calling Function(ols1) shows the formula describing the spline. Section 2.4.5 of Frank Harrell's Regression Modelin Strategies shows how the spline coefficients of the model are represented in that formula; there's some scaling involved "[f]or numerical behavior and to put all basis functions ... on the same scale." In principle you could use that information to evaluate differences in slope at various Light_uE values, taking derivatives with respect to Light_uE as functions of the spline coefficients and applying the formula for the variance of a weighted sum of random variables to get standard errors.

I think a spline is the right approach here. Especially since you can also get an "omnibus test" using an LRT to compare a spline model with the interaction term to one without an interaction term, which gives you a single measure of whether the relationships are different between the species, editors in biomedical science tend to like that kind of thing. — Zane Billings
– Zane Billings, Commented Jan 14 at 19:12
@EdM This is pretty helpful as I'm thinking about this, thank you. From my understanding--correct me if I am wrong--the example "contrast" only compares differences AT 15 across conditions, and AT 25 across conditions, correct? I would like to compare differences within a larger range. For example, "is the shape of the curve significantly different from 10 through 30?" When I plot the data (with my actual dataset, which is sensitive--this is a mock one) it's clear there are sharp and distinct visual differences between the two, that this current code does not indicate. — Zard22
– Zard22, Commented Jan 14 at 22:27
( @EdM I understand the concerns about biasing the dataset, but we preselected our Light_uE ranges beforehand, and the categories these represent.) — Zard22
– Zard22, Commented Jan 14 at 22:28
My original answer was for differences in level of CellspermL between the Species at any given value of Light_uE. I've added a simple way to evaluate differences in the linear slope of CellspermL versus Light_uE between any two Light_uE values, and some thoughts on how you might extend the approach to differences in smoothed slopes. — EdM
– EdM, Commented Jan 15 at 18:14

jarbet · Accepted Answer · 2025-01-15 23:19:40Z

You should be able to do this with generalized additive models (GAM). Here is an example:

suppressPackageStartupMessages(library(mgcv)); suppressPackageStartupMessages(library(modelbased)); #> Warning: package 'modelbased' was built under R version 4.3.3 mtcars$am <- factor(mtcars$am, levels = c(0,1), labels = c('automatic', 'manual')) # There are 2 ways for fitting a continuous by categorical interaction in gam: fit.int1 <- gam(mpg ~ s(hp, am, bs = 'fs'), data = mtcars, method = 'REML'); fit.int2 <- gam(mpg ~ s(hp, by = am) + am, data = mtcars, method = 'REML'); # fit with no interaction fit.noint <- gam(mpg ~ s(hp) + am, data = mtcars, method = 'REML'); # compare the models, smaller AIC is better AIC(fit.int1, fit.int2, fit.noint) #> df AIC #> fit.int1 7.871093 163.7976 #> fit.int2 6.820222 162.5192 #> fit.noint 6.254584 160.9220

Although we don't have evidence that the interaction improves the fit (since fit.noint has the best AIC), nevertheless, here is how you could visualize the interaction.

Regions where the confidence bands don't overlap suggest differences in the curves between the 2 groups:

plot(estimate_relation(fit.int1));

Overall, this approach is a bit informal. You use AIC or BIC to determine whether there is evidence for the curves differing between the groups. If yes, you can visualize it using modelbased::estimate_relation and areas where the confidence bands do not overlap suggest a difference in the curve between the groups. I'm guessing there is a more formal way to do this with mgcv::gam, but I'm not sure how.

Stack Exchange Network

Is there a statistical analysis that can be used in R to assess significant differences between regions of a curve?

2 Answers 2

Linked

Hot Network Questions

Is there a statistical analysis that can be used in R to assess significant differences between regions of a curve?

2 Answers 2

Linked

Related

Hot Network Questions