When are OLS linear regression parameters inaccurate?

Question

Q1: Show quantitatively that OLS regression can be applied imappropriately for linear parameters estimation. Edit: As an example, suppose we are plotting blood pressures using Machine A on the x-axis and Machine B on the y-axis. Both machines have error. If we assume that the errors are independent AND homoscedastic AND if the errors and models are related by $y(i)\approx m[x(i)-\epsilon_x(i)]+b+\epsilon_y(i)$ then this is OLS when $\lim\epsilon_x(i)\to 0$ for all $\epsilon_x(i)$. That is, OLS approximates $y(i)\approx m \,x(i)+b+\epsilon_y(i)$ when $\epsilon_y(i)$ is "nice" (e.g., normally distributed) and does not approximate $y(i)\approx m[x(i)-\epsilon_x(i)]+b+\epsilon_y(i)$. This is a very common problem in the medical literature, which is why it is posited here.

OLS in y returns a minimum error regression line for estimating y-values given equidistant x-axis values. When the x-values are not equidistant, the least error in y estimate is generally not the line corresponding to a best functional relationship between x and y, but remains a least error estimator of y given x, which is fine, if what we want as a regression goal is to estimate y given x, but is not good if we want to see, for example, how method A relates to method B for which a regression treatment that accounts for the variance of both methods is needed to establish their functional interrelationship; codependency.

We show an example of how linear OLS does not echo generating slope and intercept in the bivariate case using a Monte Carlo simulation (We are making an example, not a proof here, the question asks for a proof. Note that for low $\text{R}^2$-values the the effect is easy to show for $n$-small, and for higher $\text{R}^2$-values $n$ has to be larger. Here, $\text{R}^2\approx 0.8$. To keep the same $\text{R}^2$ value, among other possibilities, we could keep the same $X$-axis range while we increase $n$. For example, for $n=10,000$ rather than $n=1,000$, we could make $\Delta X1=0.001$).

Code:EXCEL 2007 or higher

 A B C D E F 1 X1 RAND1 RAND2 Y1=NORM.INV(RAND1,X1,SQRT(2)) Y2=NORM.INV(RAND1,X1,1) X2=NORM.INV(RAND2,X,1) 2 0 =RAND() =RAND() =NORM.INV(B2,A2,SQRT(2)) =NORM.INV(B2,A2,1) =NORM.INV(C2,A2,1) 3 =A1+0.01 =RAND() =RAND() =NORM.INV(B3,A3,SQRT(2)) =NORM.INV(B3,A3,1) =NORM.INV(C3,A3,1) 4 =A2+0.01 . . . . . 5 =A3+0.01 . . . . . . . . . . . . . . . . . . . 1001 9.99 0.391435454 0.466473036 9.60027146 9.714420306 9.905861194

First we construct a regression consistent with least squares in y for both least error in y and also for functional estimation with the correct line parameters for a regression line using a randomized but increasing $Y1$ for increasing $X1$ values, i.e., $X1=\{0,0.01,0.02,0.03,,,9.97,9.98,9.99\}$ from the line $y=X$, where $Y_i$ are randomized $y_i$-values ($\{X1,Y1\}$ in code). We do $n=1000$ times NORM.INV(RAND1, mean=$X_i$, SD=$\sqrt{2}$). From this, as the generating model is $y=X1$, which returns our generating line to within the expected confidence intervals. For our second model, keeping $y=x$, let us vary both $X_i$ and $Y_i$ ($\{X2,Y2\}$ in code), reduce the standard deviations of $X2$ and $Y2$ to 1 maintain the vector sum standard deviation at $\sqrt{2}$ and refit. That gives us the following regression plots.

This gives us the following regression parameters for the monovariate regression case, wherein all of the variability is in the y-axis variable and the least error estimate line for y given x is also the functional relationship between x and y.

 Term Coefficient 95% CI SE t statistic DF p Intercept -0.09807 -0.28222 to 0.08608 0.093842 -1.05 998 0.2962 Slope 1.017 0.985 to 1.048 0.0163 62.50 998 <0.0001

For the bivariate regression line we obtain,

Term Coefficient 95% CI SE t statistic DF p Intercept 0.2978 0.1313 to 0.4643 0.08486 3.51 998 0.0005 Slope 0.9294 0.9010 to 0.9578 0.01447 64.23 998 <0.0001

From this, we see that the OLS fit does not return a slope of 1, or an intercept of 0, which are the values of the generating function. Thus, the values returned are the least error in y estimators, with reduced slope magnitude of that line compared to the generating function.

Next, let us examine the residual structure to see the effect of mono-variate randomness in y versus bi-variate randomness in x and y.

The first image above has a rectangular normal distribution residual pattern suggesting appropriate regression. The lower image has a parallelogram structure and a skewed non-normal residual pattern, this is what I called latent information suggesting inaccuracy. Numerically, both mean residuals are near zero ($-2.33924*10^{-16}$, $-3.37952*10^{-16}$), but when normal distributions are (BIC) fit to these residuals the first remains accurate with mean $-2.33924*10^{-16}$ and standard deviation $1.4834$, but the second is a shifted, more borderline normal with mean $0.0879176$ and standard deviation $1.38753$.

Q1: How do we quantify the systemic inaccuracy, shown as an example here, in mathematical form when OLS regression in y is applied to provide not a least error in y estimate line for bivariate data, but a functional relationship between x and y? This means that if we are comparing method A with method B, e.g., cardiac ejection fraction method A, with cardiac ejection fraction method B, we seldom care what the least error estimate of a method B value is given a method A value, we might want to see how the methods relate to each other; to find the functional relationship between methods, but often we would not care to have one method predict the results of the other. For example, if the slope is positive, OLS will overestimate the y-values for small $x$, and underestimate the y-values for large $x$ because OLS will split the difference to make a least y-error, i.e., in general the slope will be flattened (reduced in magnitude),

@Tim below spent a long time discussing what is and is not bias, that there is or is not a problem, that OLS is wrong or not (it is the wrong tool for bivariate data), etc. His efforts are appreciated, however, that material is extraneous to the original intent of the question and has been deleted.

I would suggest you pick two examples "in the wild"* that 1) are in reference to a definite concrete problem, and 2) you feel represent different uses of the word "bias". That way the answers can be focused and we can all avoid talking past each other. (*Questions on this site would be good, e.g. there are many candidates that show up under the "Related" sidebar of this question.) — GeoMatt22
– GeoMatt22, Commented Dec 15, 2016 at 5:46
I object to the OPs claim that statistical bias has nothing to do with accuracy. MSE is a statistical measure of accuracy and MSE= bias^2+variance. — Michael R. Chernick
– Michael R. Chernick, Commented Dec 15, 2016 at 23:42
I'm sorry that you aren't having fun. I can't make heads or tails of this. I have no idea what you're talking about, from the get go. Starting at the top, what would it mean for bias to be consistent or inconsistent? — gung - Reinstate Monica
– gung - Reinstate Monica, Commented Dec 16, 2016 at 1:01
In statistics, the default notion of "bias" is that the betas systematically differ from their true values, ie, $E[\hat\beta_j]\ne\beta_j$, not about the y-values themselves (although that isn't wrong), & this seems suggested by your phrasing "bias... for OLS linear regression parameters", so that may be part of the confusion. (Note that Tim's answer is about bias in parameters as well.) I still don't know what you mean by "bias is inconsistent", though. Are you just asking if OLS is consistent? Among other things, I also don't understand your example, reproducible code might help. — gung - Reinstate Monica
– gung - Reinstate Monica, Commented Dec 17, 2016 at 0:34
Sometimes there are situations where everyone does something wrong without being aware of it and one individual independently discovers that fact. Scientific revolutions begin with such observations. But it's never a good idea to assume you are that person who knows the truth and is correct: it's always better to assume you don't understand something and to seek a better understanding. This question comes across as being in the former spirit, whereas to be constructive and garner replies that fit in the SE framework, it needs to be recast in the latter. — whuber
– whuber ♦, Commented Dec 19, 2016 at 22:51

Community · Accepted Answer · 2017-05-23 12:39:27Z

Initially, before the massive edits, your question was asking about the definition of bias. Quoting my other answer

Let $X_1,\dots,X_n$ be your sample of independent and identically distributed random variables from distribution $F$. You are interested in estimating unknown but fixed quantity $\theta$, using estimator $g$ being a function of $X_1,\dots,X_n$. Since $g$ is a function of random variables, estimate

$$ \hat\theta_n = g(X_1,\dots,X_n)$$

is also a random variable. We define bias as

$$ \mathrm{bias}(\hat\theta_n) = \mathbb{E}_\theta(\hat\theta_n) - \theta $$

estimator is unbiased when $\mathbb{E}_\theta(\hat\theta_n) = \theta$.

This is the definition of bias in statistics (it is the one mentioned in bias-variance tradeoff). As you and others noted, people use the term "bias" for many different things, for example, we have sampling bias and bias nodes in neural networks (or described in here) in the area of machine learning, while outside statistics there are cognitive biases, you mentioned bias in electrical engineering etc. However if you are looking for some deeper philosophical connection between those concepts, then I'm afraid that you are looking too far.

Regarding "bias" shown on your examples

TLDR; Models you compare may not illustrate what you wanted to show and may be misleading. They illustrate the omitted-variable bias, rather then some kind of OLS bias in general.

Your first example is a handbook example of linear regression model

$$ y_i \sim \mathcal{N}(\alpha + \beta x_i, \;\sigma) $$

where $Y$ is a random variable and $X$ is fixed. In your second example you use

$$ x_i \sim \mathcal{N}(z_i, \;\sigma) \\ y_i \sim \mathcal{N}(z_i, \;\sigma) $$

so both $X$ and $Y$ are both random variables that are conditionally independent given $Z$. You want to model relationship between $Y$ and $X$. You seem to expect to see slope equal to unity as if $Y$ depended on $X$ what is not true by design of your example. To convince yourself, take a closer look at your model. Below I simulate similar data as yours, with the difference that $Z$ is uniformly distributed since for me it seems more realistic then using deterministic variable (it also will make things easier later on), so the model becomes

$$ z_i \sim \mathcal{U}(0, 10) \\ x_i \sim \mathcal{N}(z_i, \;\sigma) \\ y_i \sim \mathcal{N}(z_i, \;\sigma) $$

On the plot below you can see simulated data. On the first plot we see values of $X$ vs $Z$; on the second one $Y$ vs $Z$; on third $X$ vs $Y$ with fitted regression line; and on the final plot values of $X$ vs residuals from the described regression model (similar pattern to yours). Dependence of $X$ and $Y$ to $Z$ is obvious, the dependence of $X$ to $Y$ is illusory given the variable $Z$ that they both depend on. We call this an omitted-variable bias.

This will be even more clear if we look at the regression results:

Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -3.7371 -0.9900 0.0036 0.9293 4.1523 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5842 0.1199 4.872 1.49e-06 *** x 0.8827 0.0206 42.856 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.393 on 498 degrees of freedom Multiple R-squared: 0.7867, Adjusted R-squared: 0.7863 F-statistic: 1837 on 1 and 498 DF, p-value: < 2.2e-16

and compare them to results of model that includes $Z$:

Call: lm(formula = y ~ x + z) Residuals: Min 1Q Median 3Q Max -2.5871 -0.7032 -0.0118 0.6028 3.1817 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03394 0.09146 0.371 0.711 x -0.01049 0.04532 -0.232 0.817 z 1.00824 0.04825 20.895 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.018 on 497 degrees of freedom Multiple R-squared: 0.8864, Adjusted R-squared: 0.886 F-statistic: 1940 on 2 and 497 DF, p-value: < 2.2e-16

In the first case we see strong and significant slope for $X$ and $R^2 = 0.79$ (nice!). Notice however what happens if we add $Z$ to our model: slope for $X$ diminishes almost to zero and becomes insignificant, while slope for $Z$ is large and significant, $R^2$ increases to $0.89$. This shows us that it was $Z$ that "caused" the relationship between $X$ and $Y$ since controlling it "takes out" all the $X$'s influence.

Moreover, notice that, intentionally or not, you have chosen such parameters for $Z$ that make it's influence harder to notice at first sight. If you used, for example, $\mathcal{U}(0,1)$, then the residual pattern would be much more striking.

Basically, similar things will happen no matter what $Z$ is, since the effect is caused by the fact that both $X$ and $Y$ depend on $Z$. Below you can see plots from similar model, where $Z$ is normally distributed $\mathcal{N}(0,1)$. The $R^2$ increase for this model is from $0.26$ to $0.52$ when controlling for $Z$.

In each case $Y$ depended on $Z$ and it's relationship with $X$ was illusory and caused by the fact that they both depend on $Z$. This is an important problem in statistics, but it is not caused by any pitfalls of OLS regression, or our inability to measure bias, but by using a misspecified model that does not consider some important variable.

Coca-cola adverts do not cause snow to fall and do not make people give each other presents, those things just happen together on Christmas. It would be wrong to model snowfall predicted by the screenings of Coca-cola adverts while ignoring the fact that they both happen on December.

Sidenote: I guess that what you might have been thinking of is a random design regression (or random regression; e.g. Hsu et al, 2011, An analysis of random design linear regression) but I do not think that the example you provided is relevant for discussing it.

@Carl ask yourself: will the true conditional mean differ from the estimated conditional mean? Regression is BLUE unless it's assumptions are violated. — Tim
– Tim, Commented Dec 15, 2016 at 8:05
@Carl "how to call it" it's different question, but nonetheless I'd have two problems with answering it: (1) it is not totally clear for me what is your example, and (2) it is not totally clear for me what is the problem with it -- residuals are centered at zero and quite normally distributed, I can't see the skewness you mention on the histogram etc. — Tim
– Tim, Commented Dec 15, 2016 at 9:31
@Carl it is impossible to attach Excel file in here. Also the file would not be usable for all users so you should rather provide a better and more detailed description of your problem with the simulation procedure described in detail. Moreover, I'd suggest posting it as a new question rather then changing this one or continuing discussion in comments. — Tim
– Tim, Commented Dec 15, 2016 at 12:39
@Carl well, yes, you are right -- because you purposefully, as I understand, produced the data that is inconsistent with regression model and will lead to biased results... But it has nothing to do with how regression is used on daily basics. — Tim
– Tim, Commented Dec 19, 2016 at 13:53
@Carl I am not in your field, but 1) I would guess that there are examples where a sequence of several published papers shows weaknesses in earlier analysis (perhaps even a series of papers by the same group; and with the latest paper >10 years old). 2) I believe brief excerpts (like this) would fall under "fair use" (though I am not a lawyer). 3) There is no need to be nasty when discussing any of these things! — GeoMatt22
– GeoMatt22, Commented Dec 19, 2016 at 18:27

Carl · Accepted Answer · 2020-11-09 17:36:06Z

The answer to this is well known. It is often called regression dilution and has been nicely presented elsewhere on this site. The concept of bias in this context is not as ridiculous as is made out to be here, for example, Thompson by Longford (2001) refers the reader to other methods, expanding the regression model to acknowledge the variability in the $x$ variable, so that no bias arises$^1$.

Edit: There is too much discussion about what is and is not bias. This problem is also called "omitted variable bias," And sure, there are circumstances in which least error in y is appropriate, and others in which it is inappropriate. Now obviously when the goal of using regression does not match what the regression does one can talk about bias, and when a regression method is used properly, one does not talk about bias. In general, when both x and y are distributed (meaning that the intervals between sorted x-values is not constant), least squares in y gives a least error in y answer, but does not return the generating function and does not represent the relationship between x and y. It is a bit contorted to then claim that a least error in y estimator is unbiased because the bias is that of least error in y. Now, inverting this notion, that least squares in y is an unbiased estimator of a least error estimator of y-values, is formally correct, but assumes that we actually wanted to obtain a least error in y result, which is not an invertible result. For example, least squares in x for distributed data on both axes is not the same as least squares in y, although the correlations are the same, the regression lines are not.

A example circumstance in which this becomes very important is when we wish to replace one assay with another. In that case, we are not using x to predict y, we are using x to replace y, so we would use Passing-Bablok regression or Deming Regression to do so, and not OLS.

Longford, N. T. (2001). "Correspondence". Journal of the Royal Statistical Society, Series A. 164: 565. doi:10.1111/1467-985x.00219

Stack Exchange Network

When are OLS linear regression parameters inaccurate?

2 Answers 2

Regarding "bias" shown on your examples

Linked

Hot Network Questions

When are OLS linear regression parameters inaccurate?

2 Answers 2

Regarding "bias" shown on your examples

Linked

Related

Hot Network Questions