Weighted regression

Question

I have a response variable, y.hat, that is an estimate of animal abundance. I know the standard error of y.hat. I'm skeptical of a recommendation to use the uncertainty in y.hat as a weight when I regress or calibrate y.hat to another variable. There are a few parts to consider. First, the standard error of y.hat tends to increase with y.hat. So large estimates of abundance will have less weight than lower estimates, which seemingly causes the fit to be biased low. Second, the independent variable is positively correlated with y.hat, so this means that there is more uncertainty on the right-hand side of the plot. This results in heteroscedasticity, which is when I think WLS is appropriate. What I think we have here is a potential trade-off between bias (due to weights covarying with y) vs. accommodating heteroscedasticity.

If the uncertainty was randomly assigned to each data pair, I still don't see why we'd want to use the uncertainty as weights. Here's a little R code to simulate a data when uncertainty is random (default) vs. a function of y.hat (commented out). Rather than regressing, this code calibrates x to y.hat using a mean of ratios. The result is that using weights results in a biased estimate of the true ratio (2) when uncertainty is correlated to y.hat, and an unbiased but relatively imprecise estimate when uncertainty is not correlated to y.hat.

Am I right that using uncertainty in the estimate of y as a weight is inappropriate in this context?

N <- 6 reps <- 5000 out1 <- matrix(NA, reps, 2) for (i in 1:reps){ x <- runif(N, 10, 30) y.hat <- rnorm(N, 2*x, 10) #se <- -0.1 + 0.3*y.hat se <- rnorm(N, 7, 4) w <- 1/se^2 out1[i,1] <- mean(y/x) out1[i,2] <- sum(y/x*w) / sum(w) } hist(out1[,1], 50) hist(out1[,], 50)

Normally one would use the inverse of a variance as a weight. — Glen_b
– Glen_b, Commented Aug 16, 2014 at 12:52
Can you tell us more about your situation, your data, your models, & your goals here? I gather you have an initial model that yields y.hat, & that you want to use those predicted values as the predictor in a subsequent model. Is the idea here that animal abundance is a mediator of the relationship between some variables? What is the response in the 2nd model? — gung - Reinstate Monica
– gung - Reinstate Monica, Commented Aug 18, 2014 at 17:42
gung: y.hat is an estimate of animal abundance that comes from mark-recapture techniques (common, well-vetted estimator in my field). I want to estimate animal abundance using another variable, x. x is the observed count at a particular site within the population. The goal is to estimate the total animal abundance when mark-recapture estimates did not exist but x did. I know the standard error of each y.hat. The question is whether or not I should use 1/se^2 as a weight when I fit the model. Is this a good idea under the conditions I outlined originally? — Chinook
– Chinook, Commented Aug 18, 2014 at 18:09
(@Chinook, you have to precede a username w/ the @ symbol for me to be notified of your comment.) So the mark-recapture method is the 'gold standard' here & the original numbers constituted the original y values. Then you regressed those onto something else & got predicted abundances, y.hat, as a refined measure of abundance, is that right? What was the predictor variable in the 1st model? — gung - Reinstate Monica
– gung - Reinstate Monica, Commented Aug 18, 2014 at 20:55
@gung Not exactly. Mark-recapture produces the estimate of animal abundance (y.hat). It also produces the standard error of y.hat. Then there is a second variable x. I want to use x to predict y.hat. Should I use the inverse of the standard error squared as a weight when I regress y.hat on x? — Chinook
– Chinook, Commented Aug 18, 2014 at 23:52

Adam Bailey · Accepted Answer · 2014-08-18 17:13:06Z

If you are going to use weighted least squares (WLS) to address heteroscedasticity, you need weightings that reflect the overall pattern of variation in the conditional variance of the dependent variable, in your case $Var[y.hat | x]$. You have identified one source of variance, the standard error in estimates of $y.hat$, and you have noted that this gives rise to heteroscedasticity since the standard error increases with $y.hat$. However, this may not be the only source of heteroscedasticity.

Suppose you could measure $y.hat$ with complete accuracy, and you calculated a regression of $y.hat$ on $x$. Very likely, you would still find that there was still some variability of $y.hat$ about the line of best fit. Quite possibly, you would find that the variability of $y.hat$ tends to increase with $y.hat$. If this is the case, weightings that only reflect the standard error of the estimates of $y.hat$ will not be appropriate.

You also raise the issue of weightings causing bias. This could occur if your model of heteroscedasticity has variability increasing with $y.hat$ and the weightings are based on the observed values of $y.hat$. To see this, consider two observations with the same $x$ value, one with $y.hat$ above trend and one with $y.hat$ below trend. Then, using the inverse of $y.hat$ as the weight, the below trend observation will have greater weight, tending to result in downward bias. This type of bias can be avoided by a 2-stage procedure, first using ordinary least squares (OLS), then applying WLS using as weightings the OLS-fitted values of $y.hat$ for each value of $x$. This will result in both the above-trend and below-trend observations receiving the same weighting, so avoiding bias.

Mr. Bailey: Thank you. I appreciate your discrimination among multiple sources of heteroscedasticity. More generally, is WLS only useful to overcome heteroscedasticity, or can WLS be used without bias or loss of precision when the uncertainty in the y-variable is known and used as a weight (weight = 1/var(y))? I not sure if WLS is the "default" method just because we happen to know the uncertainty in the y-variable. It sounds like your last paragraph is describing "iteratively reweighted least squares." In this case, the known uncertainty in the y-variable is not used at all, right? — Chinook
– Chinook, Commented Aug 18, 2014 at 18:34
@Chinook WLS is not the only method to address heteroscedasticity. You could also consider using OLS with robust standard errors, which has the advantage that it does not require identifying the form of the heteroscedasticity. Re my last para., if you do use WLS you potentially have two pieces of information that could be useful in identifying the form of heteroscedasticity as a basis for determining weightings: a) the pattern of the OLS residuals; b) the known standard error of $y.hat$. So the latter could be relevant, but should not alone determine the weightings. — Adam Bailey
– Adam Bailey, Commented Aug 18, 2014 at 20:12
You could also do both--use the weights and then use robust standard errors to guard against any errors in the weights or additional heteroscedasticity not captured by the weights. In fact, for the purposes of getting asymptotically consistent coefficient and standard error estimates, it doesn't actually matter what weights you use, as long as you use robust standard errors. However, YFSMMV (your finite sample mileage may vary). — Andrew M
– Andrew M, Commented Sep 30, 2014 at 9:23
@AndrewM This point happens to address an unanswered question of mine: stats.stackexchange.com/questions/90785/… You might like to post an answer. — Adam Bailey
– Adam Bailey, Commented Sep 30, 2014 at 10:09

Stack Exchange Network

Weighted regression

1 Answer 1

Linked

Hot Network Questions

Weighted regression

1 Answer 1

Linked

Related

Hot Network Questions