1
$\begingroup$

I have a list of mean values $\mu_i$ that all consist of a variable amount of values $n_i$ and I calculated each mean's corresponding standard deviation $\sigma_i$.

I want to calculate the mean value of the means ${\mu_1, ..., \mu_i}$:

$\hat{\mu}=\frac{1}{k}\sum_i^k\mu_i $ and calculate a standard deviation. What is the correct formula to use for the standard deviation on the value of the mean of means that takes into account that each mean was calculated with a different amount of values $n_i$?

It is not clear to me if I should use population or sample standard deviation for the $\sigma_i$s.

Note: this is not survey data but rather numerical data and I am trying to calculate the spread in the data of multiple identical measurements.

$\endgroup$
2
  • $\begingroup$ This question has a a correct answer here. Look up the accepted answer by heropup. It igives the answer for just 2 samples, but you can easily generalize to $k$. $\endgroup$ Commented Apr 17 at 18:50
  • $\begingroup$ Thanks, thats by far the best answer I've seen. $\endgroup$ Commented Apr 18 at 0:32

1 Answer 1

1
$\begingroup$

Your notation should clarify that $k$ rather than $n$ denotes the number of clusters so as not to conflict with the vector of sample sizes $n_i, i = 1, \ldots, k$.

It depends on your design. Even if you conduct a simple random sample at each cluster $i$, you would need to know the total cluster size $N_i; n_i \le N_i$ to combine them meaningfully. Suppose each $i$ is a USA state, and you sample 100; California is much more populous than Rhode Island. The naive "mean-of-means" could be justified if each cluster is a simple random sample (SRS) so that each person everywhere is equally likely to respond.

More often, each cluster has some level of dependence among subjects, and differing levels of participation probability. For starters you might use a frequency weight $\hat{\mu} = \sum_{i=1}^k \mu_i n_i / \sum_{i=1}^k n_i$ or a precision weight $\hat{\mu} = \sum_{i=1}^k \mu_i \sigma^{-2}_i / \sum_{i=1}^k \sigma^{-2}_i$. You can also standardize the responses to the known distributional sizes $\hat{\mu} = \sum_{i=1}^k \mu_i N_i / \sum_{i=1}^k N_i$ - take note this gets complicated when $N_i$ has uncertainty!

There are much more complicated survey methodology methods, especially combinatorial methods which handle the case that $n_i \nless \nless N_i$. The book Survey Methods by Lumley is a companion to the R package and provides a valuable and comprehensive look at how to handle different estimation methods.

$\endgroup$
2
  • $\begingroup$ Hi @AdamO, thanks for the answer. I think what's important to mention is that it's numerical data and I know all the $n_i$'s. Would that simplify matters? Also I'm not familiar with terms like cluster. $\endgroup$ Commented Apr 17 at 17:03
  • $\begingroup$ @datalung I am using $n_i$ and $N_i$ distinctly. If you sample $n_i = 100$ from your state, say Oregon, then the $N_i$ is the total population of Oregon. $\endgroup$ Commented Apr 17 at 18:28

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.