Omitted Variable Bias & Multicollinearity: Why are the coefficient SEs smaller in the unbiased specification?

Question

In Introductory Econometrics: A Modern Approach, Wooldridge writes the following regarding the omitted variable bias and its effect on the variance of the OLS estimator (x1 and x2 are correlated):

This intuitively makes sense given that by definition of the omitted variable bias x1 and x2 are correlated. Thus including x2 into the regression should inflate the variance due to multicollinearity.

However when I run a simulation in R I consistently get the exact opposite of what Wooldridge suggest. Consider the data generating process:

x1 <- rnorm(10000) x2 <- rnorm(10000) + 0.2*x1 y <- 0.5 -2*x1 -2.5*x2 + rnorm(10000) summary(lm(y ~ x1 + x2)) summary(lm(y ~ x1))

No matter how many times I run this simulation the standard error of beta1 in the omitted variable case is always larger than in the unbiased specification.

How is this possible?

StatsStudent · Accepted Answer · 2019-02-19 17:55:13Z

First, to make correlated RVs use

> rho=.2 > x2<-rho*x1+sqrt(1-rho^2)*rnorm(10000)

Your RVs did not have the correlation you expected.

Second, by your selection of $\rho=0.2$, you've made $x_1$ and $x_2$ only weakly correlated. Omitting $x_2$ will force the linear model to "stretch" $x_1$ (high variance in $\beta_1$) to try to cover most of what $x_2$ was covering, so you are seeing the correct behavior because $x_1$ and $x_2$ are not functionally multicollinear.

If you set the correlation to 0.9, you might see what you are expecting. Here are my results using for the two cases

> x1 <- rnorm(10000) > rho=.2 #then rho=0.9 > x2=rho*x1+sqrt(1-rho^2)*rnorm(10000) > y <- 0.5 -2*x1 -2.5*x2 + rnorm(10000) > print(summary(lm(y ~ x1 + x2))) > print(summary(lm(y ~ x1)))

First $\rho=.2$ with experimental cor=0.1943525

Biased result: $se(\beta_1)=0.02691$

Unbiased result: $se(\beta_1)=0.01026$

as we would expect for NON collinear $x_1,x_2$

Now for $\rho=0.9$ with experimental cor=0.8967111

Biased result: $se(\beta_1)=0.01497$

Unbiased results: $se(\beta_1)=0.02289$

as you properly understand and expect for truly multicollinear data.

So this give us a way of determining a practical definition of multicollinearity: $\rho=.825$ is the point at which increasing the correlation causes a crossover from larger se in the biased to larger se in the unbiased model.

This is consistent with what I've read in econometrics and other social science texts, but had not tested myself until now.

Thanks for your reply! Thank you for pointing out the correlation generation. For small rho it doesn't make much of a difference but for large rho they diverge quite significantly, I adapted your code now. You are right, when I set the correlation to 0.9 I get the expected effect. But Wooldridge never specifies that the correlation has to be high, so this does not quite answer my question yet — Jonas C
– Jonas C, Commented Feb 19, 2019 at 13:08

Jonas C · Accepted Answer · 2019-02-19 17:20:50Z

I think I have an answer, but if anyone knows more please let me know:

The variance of the coefficients can be expressed as

$$ Var(\beta_j) = \frac{\sigma^2}{\sum(x_{ij} - \bar{x_j})(1 - R^2_j)} $$

where $ R^2_j $ is the R squared if one regresses the independent variables on $ x_j$

Thus the variance of the coefficient estimator and subsequently the SE too falls with a falling $ \sigma^2$ and rises with a rising $ R^2_j $. Since the latter measures the correlation between $ x_j$ and the other independent variables it makes sense that $ Var(\beta_j) $ rises if the correlation between $ x_j$ and any other independent variable rises. This is basic multicollinearity. It does also imply that it does not matter how high the correlation is, as long as there is correlation $ (1 - R^2_j)$ will always inflate the coefficient variance to some extend.

But, since the unbiased model is better able to explain the variation in y it will have a smaller estimate of $ \sigma^2$, thus pushing down the estimated $ Var(\beta_j) $. So there must be some sort of tradeoff between a smaller variance estimate and a higher correlation.

To test the correlation for which both effects cancel each other out I simulated the following in R

roh <- seq(0.1, 0.9, 0.01) se <- data.frame(se_b = rep(NA, 100), se_u = rep(NA, 100)) se_roh <- data.frame(se_b = rep(NA, length(roh)), se_u = rep(NA, length(roh)), roh = rep(NA, length(roh))) i <- 1 for(i in 1:length(roh)){ while(j <= 100){ x1 <- rnorm(10000) x2 <- roh[i]*x1 + sqrt(1-roh[i]^2)*rnorm(10000) y <- 1 + x1 + x2 + rnorm(10000) unbiased <- summary(lm(y ~ x1 + x2)) biased <- summary(lm(y ~ x1)) se$se_b[j] <- biased$coefficients[2,2] se$se_u[j] <- unbiased$coefficients[2,2] j <- j + 1 } se_roh$se_b[i] <- mean(se$se_b) se_roh$se_u[i] <- mean(se$se_u) se_roh$roh[i] <- roh[i] i <- i + 1 j <- 1 } se_roh

This then shows that a correlation coefficient of roughly 0.6 is where both effects cancel each other out.

I'm interested in any comments on this answer.

Could you point to a derivation/reference for the variance formula you wrote, please? — Lino Ferreira
– Lino Ferreira, Commented Mar 28, 2024 at 14:34
Answering my own question, this is Theorem 3.2 in the 4th edition of Wooldridge's Introductory Econometrics. — Lino Ferreira
– Lino Ferreira, Commented Apr 2, 2024 at 19:30

Stack Exchange Network

Omitted Variable Bias & Multicollinearity: Why are the coefficient SEs smaller in the unbiased specification?

2 Answers 2

Hot Network Questions

Omitted Variable Bias & Multicollinearity: Why are the coefficient SEs smaller in the unbiased specification?

2 Answers 2

Related

Hot Network Questions