Differences between approaches to exponential regression

Question

One could fit an exponential in many different ways. This post suggests doing the down-and-dirty lm on the log of the response variable. This SO post suggests using nls which requires a starting estimate. This SO post suggests glm with a gamma/log link function. Here, the illustrious @Glen-b explains some potential differences between approaches.

What are the pros/cons and domains of applicability for these different approaches? Do these methods differ in how well or in what way they calculate confidence intervals?

Like all the other data scientists at home right now, I'm messing around with Covid 19 data.

One thing in particular I noticed is that I can do lm with log, log10, log2 etc., but would have to convert from natural log with glm.

last_14 = data.frame(rbind( c(3460, 14, 0), c(3558, 17, 1), c(3802, 21, 2), c(3988, 22, 3), c(4262, 28, 4), c(4615, 36, 5), c(4720, 40, 6), c(5404, 47, 7), c(5819, 54, 8), c(6440, 63, 9), c(7126, 85, 10), c(7905, 108, 11), c(8733, 118, 12), c(9867, 200, 13))) names(last_14) = c('World', 'US', 'days') lm(log(World) ~ days, last_14) #> #> Call: #> lm(formula = log(World) ~ days, data = last_14) #> #> Coefficients: #> (Intercept) days #> 8.06128 0.08142 glm(formula = World ~ days, data=last_14, family=gaussian(link='log')) #> #> Call: glm(formula = World ~ days, family = gaussian(link = "log"), #> data = last_14) #> #> Coefficients: #> (Intercept) days #> 8.00911 0.08819 #> #> Degrees of Freedom: 13 Total (i.e. Null); 12 Residual #> Null Deviance: 54450000 #> Residual Deviance: 816200 AIC: 199.4 nls(World ~ exp(a + b*days), last_14, start=list(a=5, b=0.03)) #> Nonlinear regression model #> model: World ~ exp(a + b * days) #> data: last_14 #> a b #> 8.00911 0.08819 #> residual sum-of-squares: 816246 #> #> Number of iterations to convergence: 8 #> Achieved convergence tolerance: 1.25e-06

^{Created on 2020-03-20 by the reprex package (v0.3.0)}

I am not sure about the answer to the question (it is very broad). But regarding your little problem with nls you can try using the formula $$a \cdot \text{exp} (b \cdot \text{days})$$ instead of $$ \text{exp} (b \cdot \text{days})$$ For instance, this code will work: nls(World ~ a*exp(b*days), last_14, start=list(a=100000, b=0.3)) — Sextus Empiricus
– Sextus Empiricus, Commented Mar 20, 2020 at 21:16

Demetri Pananos · Accepted Answer · 2020-03-21 16:32:12Z

One of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.
So far as I understand, nls assumes a Gaussian likelihood as well, so in this model $ y \vert t \sim \mathcal{N}(\exp(\beta_0 + \beta t), \sigma^2)$. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood. Furthermore, if you fit your data using nls, you'll see that it fits later data very well but not early data. That is because misfitting later data incurrs larger loss and the goal of nls is to minimize this loss.
The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. In this model, $y \vert t \sim \text{Gamma}(\mu(x), \phi)$ with $\mu(x) = g^{-1}(\beta_0 + \beta_1)$. We call $g$ the link, and for the case of log link $\mu(x) = \exp(\beta_0 + \beta_1 t)$. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

Great job! If you (or anyone else) has more details to add, I think many would benefit. — abalter
– abalter, Commented Mar 20, 2020 at 21:34
Hey, why is glm giving me different regression coefficients than lm with log? — abalter
– abalter, Commented Mar 20, 2020 at 21:57
@abalter Aha! You've fallen trap to the con I've listed! The coefficients are on the scale of the link, not on the scale of the data. You're going to have to apply the inverse link to the coefficients to get their effect on the scale of the data. See log odds ratios in logistic regression, for example. — Demetri Pananos
– Demetri Pananos, Commented Mar 20, 2020 at 21:59
@abalter you have been using glm with a gamma distribution of the errors. When you use Normal distributed errors (with log link) then you get the same as using lm with log. — Sextus Empiricus
– Sextus Empiricus, Commented Mar 20, 2020 at 22:02
@SextusEmpiricus -- that's fascinating. They come out so different! And, of course, this is actually count data, so far without replacement (no reinfects as yet). That should provide additional insight into the proper model. — abalter
– abalter, Commented Mar 20, 2020 at 22:04

Sextus Empiricus · Accepted Answer · 2020-05-13 11:29:32Z

A known difference between fitting an exponential curve with a nonlinear fitting or with a linearized fitting is the difference in the relevance of the error/residuals of different points.

You can notice this in the plot below.

In that plot you can see that

the linearized fit (the broken line) is fitting more precisely the points with small values (see the plot on the right where the broken line is closer to the values in the beginning).

the non linear fit is closer to the points with high values.

modnls <- nls(US ~ a*exp(b*days), start=list(a=100, b=0.3)) modlm <- lm(log(US) ~ days ) plot(days,US, ylim = c(1,15000)) lines(days,predict(modnls)) lines(days,exp(predict(modlm)), lty=2) title("linear scale", cex.main=1) legend(0,15000,c("lm","nls"),lty=c(2,1)) plot(days,US, log = "y", ylim = c(100,15000)) lines(days,predict(modnls)) lines(days,exp(predict(modlm)), lty=2) title("log scale", cex.main=1)

Getting the random noise modeled correctly is not always right in practice

In practice the problem is not so often what sort of model to use for the random noise (whether it should be some sort of glm or not).

The problem is much more that the exponential model (the deterministic part) is not correct, and the choice of fitting a linearized model or not is a choice in the strength between the first points versus fitting the last points. The linearized model fits very well the values at a small size and the non-linear model fits better the values with high values.

You can see the incorrectness of the exponential model when we plot the ratio of increase.

When we plot the ratio of the increase, for the world variable, as function of time, then you can see that it is a non-constant variable (and for this period it appears to be increasing). You can make the same plot for the US but it is very noisy, that is because the numbers are still small and differentiating a noisy curve makes the noise:signal ratio larger.

(also note that the error terms will be incremental and if you really wish to do it right then you should use some arima type of model for the error, or use some other way to make the error terms correlated)

I still don't get why lm with log gives me completely different coefficients. How do I convert between the two?

The glm and nls model the errors both as $$y−y_{model}∼N(0,\sigma^2)$$ The linearized model models the errors as $$log(y)−log(y_{model})∼N(0,\sigma^2)$$ but when you take the logarithm of values then you change the relative size. The difference between 1000.1 and 1000 and 1.1 and 1 is both 0.1. But on a log scale it is not the same difference anymore.

This is actually how the glm does the fitting. It uses a linear model, but with transformed weigths for the errors (and it iterates this a few times). See the following two which return the same result:

last_14 <- list(days <- 0:13, World <- c(101784,105821,109795, 113561,118592,125865,128343,145193,156094,167446,181527,197142,214910,242708), US <- c(262,402,518,583,959,1281,1663,2179,2727,3499,4632,6421,7783,13677)) days <- last_14[[1]] US<- last_14[[3]] World <- last_14[[2]] Y <- log(US) X <- cbind(rep(1,14),days) coef <- lm.fit(x=X, y=Y)$coefficients yp <- exp(X %*% coef) for (i in 1:100) { # itterating with different # weights w <- as.numeric(yp^2) # y-values Y <- log(US) + (US-yp)/yp # solve weighted linear equation coef <- solve(crossprod(X,w*X), crossprod(X,w*Y)) # If am using lm.fit then for some reason you get something different then direct matrix solution # lm.wfit(x=X, y=Y, w=w)$coefficients yp <- exp(X %*% coef) } coef # > coef # [,1] # 5.2028935 # days 0.3267964 glm(US ~days, family = gaussian(link = "log"), control = list(epsilon = 10^-20, maxit = 100)) # > glm(US ~days, # + family = gaussian(link = "log"), # + control = list(epsilon = 10^-20, maxit = 100)) # # Call: glm(formula = US ~ days, family = gaussian(link = "log"), control = list(epsilon = 10^-20, # maxit = 100)) # # Coefficients: # (Intercept) days # 5.2029 0.3268 # # Degrees of Freedom: 13 Total (i.e. Null); 12 Residual # Null Deviance: 185900000 # Residual Deviance: 3533000 AIC: 219.9

Conditional mean for the log linear model is known to be off by a factor proportional to exp(RMSE). See here. Would you mind correcting the predictions for that model and replacing the plots? — Demetri Pananos
– Demetri Pananos, Commented Mar 20, 2020 at 21:56
@DemetriPananos I do not understand what you mean. I just plotted the results from two different least squares fits, the one linearized and the other not, — Sextus Empiricus
– Sextus Empiricus, Commented Mar 20, 2020 at 22:00
exp(predict(modlm) is not the conditional mean on the natural scale for this model. As is argued in that blog post, you need to multiply by a factor that looks like exp(RMSE^2/2). — Demetri Pananos
– Demetri Pananos, Commented Mar 20, 2020 at 22:03
@DemetriPananos but lm is predicting the conditional mean on the log scale. I am only transforming this prediction on the log scale. I am not sure what you mean that the scaling with that factor is supposed to do. — Sextus Empiricus
– Sextus Empiricus, Commented Mar 20, 2020 at 22:11
Are you not transforming the predictions to the original scale by doing exp(predict(modlm)). The relevant part of the link I've provided is "We, however, have no real interest in E(ln(yj)). We fit this log regression as a way of obtaining estimates of our real model, namely yj = exp(b0 + Xjb + εj) So rather than taking the expectation of ln(yj), lets take the expectation of yj. " — Demetri Pananos
– Demetri Pananos, Commented Mar 20, 2020 at 22:13

David Schneider · Accepted Answer · 2023-01-20 22:27:51Z

For a comparison of exponential models fitted in competing ways see:

Best Fit for Exponential Data

This shows comparison in a case where exponential change was chosen in advance, as appropriate to the question (exponential increase in seal numbers after 1972 Marine Mammal Protection Act). The comparison shows the expected difference between log(y) and y as response variables, as described above.

Stack Exchange Network

Differences between approaches to exponential regression

3 Answers 3

Getting the random noise modeled correctly is not always right in practice

Linked

Hot Network Questions

Differences between approaches to exponential regression

3 Answers 3

Getting the random noise modeled correctly is not always right in practice

Linked

Related

Hot Network Questions