Skip to main content
added 220 characters in body
Source Link
Demetri Pananos
  • 42.3k
  • 2
  • 67
  • 167

One of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.

  • So far as I understand, nls assumes a Gaussian likelihood as well, so in this model $ y \vert t \sim \mathcal{N}(\exp(\beta_0 + \beta t), \sigma^2)$. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood. Furthermore, if you fit your data using nls, you'll see that it fits later data very well but not early data. That is because misfitting later data incurrs larger loss and the goal of nls is to minimize this loss.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function.  In this model, $y \vert t \sim \text{Gamma}(\mu(x), \phi)$ with $\mu(x) = g^{-1}(\beta_0 + \beta_1)$. We call $g$ the link, and for the case of log link $\mu(x) = \exp(\beta_0 + \beta_1 t)$. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

One of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.

  • So far as I understand, nls assumes a Gaussian likelihood as well. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood. Furthermore, if you fit your data using nls, you'll see that it fits later data very well but not early data. That is because misfitting later data incurrs larger loss and the goal of nls is to minimize this loss.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function.  Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

One of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.

  • So far as I understand, nls assumes a Gaussian likelihood as well, so in this model $ y \vert t \sim \mathcal{N}(\exp(\beta_0 + \beta t), \sigma^2)$. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood. Furthermore, if you fit your data using nls, you'll see that it fits later data very well but not early data. That is because misfitting later data incurrs larger loss and the goal of nls is to minimize this loss.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. In this model, $y \vert t \sim \text{Gamma}(\mu(x), \phi)$ with $\mu(x) = g^{-1}(\beta_0 + \beta_1)$. We call $g$ the link, and for the case of log link $\mu(x) = \exp(\beta_0 + \beta_1 t)$. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

added 220 characters in body
Source Link
Demetri Pananos
  • 42.3k
  • 2
  • 67
  • 167

OnceOne of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.

  • So far as I understand, nls assumes a Gaussian likelihood as well. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood. Furthermore, if you fit your data using nls, you'll see that it fits later data very well but not early data. That is because misfitting later data incurrs larger loss and the goal of nls is to minimize this loss.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

Once of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.

  • So far as I understand, nls assumes a Gaussian likelihood as well. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

One of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.

  • So far as I understand, nls assumes a Gaussian likelihood as well. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood. Furthermore, if you fit your data using nls, you'll see that it fits later data very well but not early data. That is because misfitting later data incurrs larger loss and the goal of nls is to minimize this loss.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

added 237 characters in body
Source Link
Demetri Pananos
  • 42.3k
  • 2
  • 67
  • 167

Once of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this bias.

  • So far as I understand, nls assumes a Gaussian likelihood as well. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

I've got to run, but this should be a good starting place. I'll add some details when I return.

Once of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this bias.

  • So far as I understand, nls assumes a Gaussian likelihood as well. Except now, we let the conditional mean of the outcome be non-linear.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

I've got to run, but this should be a good starting place. I'll add some details when I return.

Once of the differences is the likelihoods for each model. In case readers can't remember, the likelihood encapsulates assumptions about the conditional distribution of the data. In the case of COVID-19, this would be the distribution of infections (or reported new cases, or deaths, etc) on the given day. Whatever we want the outcome to be, let's call it $y$. Thus, the conditional distribution (e.g. the number of new cases today) would be $y\vert t$ (think of this as $y$ conditioned on $t$).

  • In the case of taking the log and then performing lm, this would mean that $\log(y)\vert t \sim \mathcal{N}(\mu(x), \sigma^2) $. Equivalently, that $y$ is lognormal given $t$. The reason we do linear regression on $\log(y)$ is because on the log scale, the conditional mean is independent of the variance, where as the mean of the log normal is also a function of the variance. So Pro: we know how to do linear regression, but Con This approach makes linear regression assumptions on the log scale which can always be assessed but might be hard to theoretically justify? Another con is that people do not realize that predicting on the log scale and then taking the exponential actually biases predictions by a factor if $\exp(\sigma^2/2)$ if I recall correctly. So when you make predictions from a log normal model, you need to account for this.

  • So far as I understand, nls assumes a Gaussian likelihood as well. Except now, we let the conditional mean of the outcome be non-linear. This can be a pain because no confidence intervals are not bounded below by 0, so your model might estimate a negative count of infections. Obviously, that can't happen. When the count of infections (or whatever) is larger, then a Gaussian can justifiable. But when things are just starting, then this probably isn't the best likelihood.

  • The approach with glm frees is a little and allows us to control the conditional distribution as well as the form of the conditional mean through the link function. Pro These models are much more expressive, but I think the power comes from the ability to perform inference with a likelihood which is not normal. This lifts a lot of the restrictions, for example symmetric confidence intervals. The Con is that you need a little more theory to understand what is going on.

Source Link
Demetri Pananos
  • 42.3k
  • 2
  • 67
  • 167
Loading