24
$\begingroup$

Suppose that $w_1,w_2,\ldots,w_n$ and $x_1,x_2,...,x_n$ are each drawn i.i.d. from some distributions, with $w_i$ independent of $x_i$. The $w_i$ are strictly positive. You observe all the $w_i$, but not the $x_i$; rather you observe $\sum_i x_i w_i$. I am interested in estimating $\operatorname{E}\left[x\right]$ from this information. Clearly the estimator $$ \bar{x} = \frac{\sum_i w_i x_i}{\sum_i w_i} $$ is unbiased, and can be computed given the information at hand.

How might I compute the standard error of this estimator? For the sub-case where $x_i$ takes only values 0 and 1, I naively tried $$ se \approx \frac{\sqrt{\bar{x}(1-\bar{x})\sum_i w_i^2}}{\sum_i w_i}, $$ basically ignoring the variability in the $w_i$, but found that this performed poorly for sample sizes smaller than around 250. (And this probably depends on the variance of the $w_i$.) It seems that maybe I don't have enough information to compute a 'better' standard error.

$\endgroup$

3 Answers 3

24
$\begingroup$

I ran into the same issue recently. The following is what I found:

Unlike a simple random sample with equal weights, there is no widely accepted definition of standard error of the weighted mean. These days, it would be straight-forward to do a bootstrap and obtain the empirical distribution of the mean, and based on that estimate the standard error.

What if one wanted to use a formula to do this estimation?

The main reference is this paper, by Donald F. Gatz and Luther Smith, where 3 formula based estimators are compared with bootstrap results. The best approximation to the bootstrap result comes from Cochran (1977):

$(SEM_w)^2={\dfrac{n}{(n-1)(\sum {P_i})^2}}[\sum (P_i X_i-\bar{P}\bar{X}_w)^2-2 \bar{X}_w \sum (P_i-\bar{P})(P_i X_i-\bar{P}\bar{X}_w)+\bar{X}^2_w \sum (P_i-\bar{P})^2]$

The following is the corresponding R code that came from this R listserve thread.

weighted.var.se <- function(x, w, na.rm=FALSE) # Computes the variance of a weighted mean following Cochran 1977 definition { if (na.rm) { w <- w[i <- !is.na(x)]; x <- x[i] } n = length(w) xWbar = weighted.mean(x,w,na.rm=na.rm) wbar = mean(w) out = n/((n-1)*sum(w)^2)*(sum((w*x-wbar*xWbar)^2)-2*xWbar*sum((w-wbar)*(w*x-wbar*xWbar))+xWbar^2*sum((w-wbar)^2)) return(out) } 

Hope this helps!

$\endgroup$
8
  • $\begingroup$ This is pretty cool, but for my problem I don't even observe the $P_iX_i$, rather I observe the sum $\sum_i P_iX_i$. My question is very weird because it involves some information asymmetry (a third party is reporting the sum, and trying to perhaps hide some information). $\endgroup$ Commented Aug 10, 2012 at 0:20
  • $\begingroup$ Gosh you're right, sorry I did not fully understand the question you posed. Suppose we boil your problem down to the simplest case where all $w_i$ are Bernoulli RV. Then you are essentially observing the sum of a random subset of $n$ RVs. My guess is there is not a lot of information here to estimate with. So what did you end up doing for your original problem? $\endgroup$ Commented Aug 10, 2012 at 15:00
  • $\begingroup$ @Ming-ChihKao this cochran formula is interesting but if you build a confidence interval off this when the data is not normal there is no consistent interpretation correct? How would you handle non-normal weighted average mean confidence intervals? Weighted quantiles? $\endgroup$ Commented Jul 16, 2016 at 1:55
  • $\begingroup$ I think there is an error with the function. If you substitute w=rep(1, length(x)), then weighted.var.se(rnorm(50), rep(1, 50)) is about 0.014. I think the formula is missing a sum(w^2) in the numerator, since when P=1, the variance is 1/(n*(n-1)) * sum((x-xbar)^2). I can't check the cited article as it is behind a paywall, but I think that correction. Oddly enough, Wikipedia's (different) solution becomes degenerate when all weights are equal: en.wikipedia.org/wiki/…. $\endgroup$ Commented Jun 24, 2018 at 5:08
  • $\begingroup$ These may work better in general: analyticalgroup.com/download/WEIGHTED_MEAN.pdf $\endgroup$ Commented Jun 24, 2018 at 5:08
8
$\begingroup$

The variance of your estimate given the $w_i$ is $$ \frac{\sum w_i^2 Var(X)}{(\sum w_i)^2} = Var(X) \frac{\sum w_i^2 }{(\sum w_i)^2}. $$ Because your estimate is unbiased for any $w_i$, the variance of its conditional mean is zero. Hence, the variance of your estimate is $$ Var(X) \mathbb{E}\left(\frac{\sum w_i^2 }{(\sum w_i)^2}\right) $$ With all the data observed, this would be easy to estimate empirically. But with only a measure of location of the $X_i$ observed, and not their spread, I don't see how it's going to be possible to get an estimate of $Var(X)$, without making rather severe assumptions.

$\endgroup$
1
  • $\begingroup$ at least in the specific case where $x_i$ have a Bernoulli distribution I can estimate the variance of $x$ by $\bar{x}(1-\bar{x})$ as noted above. Even in this case, as noted in the question, I need a larger sample size than I would have expected. $\endgroup$ Commented Apr 5, 2012 at 17:42
2
$\begingroup$

@Ming K 's equation is not working for me. @Hugh mentioned Hmisc::wtd.var(x, w), but this is for variance, if you are wondering about weighted standard error, this would be useful. But read assumption and equation here, following $$ \sigma _{x}^{-} = \sigma \sqrt{\sum_{i=1}^{n}\omega _{i}^{'2}} $$

For your convenient, I copy them here.

wtd.stderror <- function(x, weights){ var <- Hmisc::wtd.var(x, weights) weights <- sum( (weights / sum(weights))^2 ) sqrt(var*weights) } 

But I am not sure whether this will work for dateset with a Bernoulli distribution

$\endgroup$
2
  • $\begingroup$ Please re-read the question: the $x_i$ are not observed or are otherwise unavailable for this computation. $\endgroup$ Commented Oct 15, 2021 at 19:37
  • $\begingroup$ Sorry, I see, this is to supplement Ming K's and Hugh's answer $\endgroup$ Commented Oct 17, 2021 at 11:35

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.