7
$\begingroup$

I'm trying to know whether a low R-squared value would pose a problem when assessing the coefficients. My population is divided into two groups (A and B), and I want to assess if there's a significant difference between them concerning the dependent variable. I've included several control variables in the regression, but the R-squared values are very low (<0.05). Should I be concerned about this, and would it be advisable to add more control variables?

Suppose I would want to publish the results in an academic journal, would this pose a problem?

$\endgroup$
1

2 Answers 2

13
$\begingroup$

R squared tells you nothing important when it comes to causal inference and the validity of your estimate.

In brief:

  • R squared tells us how much variation in your outcome is explained by variation in our predictor.
  • Covariates with small causal effects in systems with high noise can result in models with low R squared
  • Despite this, we can still detect causal effects of variables if our identification strategy is valid and we are powered to do so.
  • Adding more variables can reduce the variability in the outcome, which leads to higher R squared and better precision, but there is a legitimate risk that, if we are not careful, we add a variable which breaks our identification strategy (e.g. a cause of the treatment, a collider, etc).

Does a low R squared model pose a problem in assessing the coefficients? No, not necessarily . What a low R squared model tells me is that the system is quite noisy as compared to the signal from the covariate. This may pose a problem for statistical power, but that could (in principle) be combated through getting more data or adjusting for appropriate covariates to reduce residual variation (see my last point above).

Generally, I would not worry about low R squared models simply because they have low R squared. As an example, a model for a binary exposure in an RCT with a binary outcome will almost surely have low R squared, and yet if our identification strategy is correct, we can use a linear model to estimate treatment effects in such scenarios (and do so quite reliably).

$\endgroup$
2
  • $\begingroup$ Sorry, I misinterpret this maybe... Identification is not the same as estimation, and R-square reflects the latter. In your last paragraph, I think your example needs to be caveated a bit with the fact we still need a reasonably accurate estimate (i.e. small SE) around our treatment effects. If we had a high R-square accurate estimates would be (more) likely, but a low R-square doesn't preclude them. Going ahead and saying "Well, identification is done, so we are estimating and reporting whatever $\beta$ is there as truthful." can result into misleading insights. $\endgroup$ Commented Aug 26, 2024 at 0:32
  • 1
    $\begingroup$ @usεr11852 Yes, there is the whole other part of designing the study to be powered/have small standard errors, etc. The point I am making here is that we can have a well identified and estimated causal contrast and have a low R squared model, and R squared tells us very little about interpreting the estimate as causal. Yes, a high R squared means low residual standard error -- which implies a high precision estimate -- but as I mention in my post, low precision can be combated in a number of ways (e.g. more data, variance reduction techniques) and we may still have a low R squared. $\endgroup$ Commented Aug 26, 2024 at 1:20
5
$\begingroup$

Suppose you have a linear regression model $Y = \beta_0 + \beta_1 T + \beta_2 X + \dotsc + \epsilon$ where $Y$ is the weight of livestock, $T \in \{0,1\}$ represents treatment assignment between group A and B or antidote and growth hormone, and $X$ is a control variable such as initial weight.

Objective of a causal inference: getting unbiased, efficient estimate of the marginal effect of a treatment on the outcomes, which is typically measured by $\beta_1$ the coefficient of $T$ in a model for $Y$. Particularly, we want to find evidence if $\beta_1 \neq 0$. It has little to do with $R^2$. Under correct experimental design and model specification, it is possible to have a very large $\beta_1$ but very small $R^2$, which means that there are many other determinants of $Y$ not accounted for remaining in the error term. On the other hand, it is also possible to have a very small $\beta_1$ but very large $R^2$, which means that there are just a few determinants of $Y$ mostly accounted for by $T$ and $X$. A small $R^2$ invalidates neither the experimental design nor the model specification, as long as $\beta_1$ is unbiased.

Benefits of adding more covariates or control variables: smaller standard errors and possible correction for confounding effects in observational studies. By having more predictors, $R^2$ increases while the residual standard error decreases. Because all coefficients' standard errors are proportional to that of the residual, the standard error of the treatment coefficient will be smaller with additional covariates. Holding the point estimate of the treatment coefficient constant, a smaller standard error of the estimate corresponds to a larger $t$ or $z$ statistic and a smaller $p$ value. This means that a causal effect could appear nonsignificant when important covariates are missing but significant when they are included. Adding covariates is necessary if they affect both treatment and outcome. For example, $X$ is the initial weight which of course determines the ending weight $Y$. If hormone use $T$ is more likely among animals with lower $X$, which makes $X$ and $T$ negatively correlated, $\beta_1$ will absorb effects of both $T$ and the portion of $X$ that correlates with $T$ if $X$ is omitted and return a smaller estimate than the unbiased one.

Caveats of adding more covariates: avoiding variables that distort what $\beta_1$ is supposed to measure. Recall that $\beta_1$ should return the marginal effect of $T$ switching from 0 to 1 on $Y$, holding $X$ constant. If $T$ cannot freely switch between 0 to 1 on $Y$ without changes in $X$, then the model is misspecified. This could happen if $X$ is either the cause or effect of $T$. For example, $X$ is a screening procedure, and only those passing the screening will receive any growth hormone, so $X$ is the cause of $T$; if $X$ is the length, which increases along with weight and can be affected by $T$, so both $X$ and $Y$ are the effect of $T$. In the former case, $X$ should be removed from the equation and act as an instrument variable for $T$. See "instrument variable estimator." In the latter case, $X$ should be removed from the equation and analyzed separately under a different topic. These decisions are made based on $\beta_1$ instead of $R^2$. See What variables to include/exclude when estimating causal relationships using regression.

Key assumption of an unbiased treatment effect: $T$ is independent from the error term $\epsilon$. This usually comes from the experimental design instead of any modeling effort if $T$ is assigned randomly, such as randomized, double-blind, controlled trial. However, treatment heterogeneity can be present even under random assignment where functions, polynomials, and interactions of $T$ should be included. If omitted, these terms related to $T$ will be absorbed in $\epsilon$, making $T$ and $\epsilon$ correlated and biasing the effect estimate. When $T$ is binary, interaction terms between $T$ and $X$ may be necessary if the effect of $T$ on $Y$ varies with $X$. For example, if the growth hormone accelerates weight gain among female animals but inhibits growth of male ones, omitting interaction between hormone $T$ and sex $X$ may find $\beta_1$ nearly zero and miss the important discovery that the effect is negative among males and positive among females.

Overall, a good causal inference study requires careful experimental design and correct model specification, which is assessed both by the property of $\beta_1$ and neither by the size of $R^2$.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.