19
$\begingroup$

I have an array of $n$ real values, which has mean $\mu_{old}$ and standard deviation $\sigma_{old}$. If an element of the array $x_i$ is replaced by another element $x_j$, then new mean will be

$\mu_{new}=\mu_{old}+\frac{x_j-x_i}{n}$

Advantage of this approach is it requires constant computation regardless of value of $n$. Is there any approach to calculate $\sigma_{new}$ using $\sigma_{old}$ like the computation of $\mu_{new}$ using $\mu_{old}$?

$\endgroup$
5
  • $\begingroup$ Is this homework? A very similar task was asked in our course of mathematical statistics... $\endgroup$ Commented Mar 19, 2012 at 12:47
  • 2
    $\begingroup$ @user946850: No, it's not homework. I am conducting my thesis on Evolutionary Algorithm. I want to use standard deviation as a measure of population diversity. Just looking for more efficient solution. $\endgroup$ Commented Mar 19, 2012 at 12:50
  • 1
    $\begingroup$ The SD is the square root of the variance, which is just the mean squared value (adjusted by a multiple of the squared mean, which you already know how to update). Therefore, the same methods used to compute a running mean can be applied without any fundamental change to compute a running variance. In fact, much more sophisticated statistics can be computed on an online basis using the same ideas: see the threads at stats.stackexchange.com/questions/6920 and stats.stackexchange.com/questions/23481, for example. $\endgroup$ Commented Mar 19, 2012 at 19:14
  • 1
    $\begingroup$ @whuber: This is mentioned in the Wikipedia article for Variance, but also with a note on catastrophic cancellation (or loss of significance) that may occur. Is this overrated, or a real problem for the running variance? $\endgroup$ Commented Mar 20, 2012 at 9:56
  • 1
    $\begingroup$ That's a great question. If you accumulate the variances naively, without centering them beforehand, you can indeed get into trouble. The problem occurs when the numbers are huge but their variance is small. E.g., consider a series of accurate measurements of the speed of light in m/s, as in 299792458.145, 299792457.883, 299792457.998, ...: their variance, which is around 0.01, is so small compared to their squares, which is around $10^{17}$, that careless calculation (even in double precision) would result in zero variance: all significant digits would vanish. $\endgroup$ Commented Mar 20, 2012 at 15:36

4 Answers 4

8
$\begingroup$

A section in the Wikipedia article on "Algorithms for calculating variance" shows how to compute the variance if elements are added to your observations. (Recall that the standard deviation is the square root of the variance.) Assume that you append $x_{n+1}$ to your array, then

$$\sigma_{new}^2 = \sigma_{old}^2 + (x_{n+1} - \mu_{new})(x_{n+1} - \mu_{old}).$$

EDIT: Above formula seems to be wrong, see comment.

Now, replacing an element means adding an observation and removing another one; both can be computed with the formula above. However, keep in mind that problems of numerical stability may ensue; the quoted article also proposes numerically stable variants.

To derive the formula by yourself, compute $(n-1)(\sigma_{new}^2 - \sigma_{old}^2)$ using the definition of sample variance and substitute $\mu_{new}$ by the formula you gave when appropriate. This gives you $\sigma_{new}^2 - \sigma_{old}^2$ in the end, and thus a formula for $\sigma_{new}$ given $\sigma_{old}$ and $\mu_{old}$. In my notation, I assume you replace the element $x_n$ by $x_n'$:

$$ \begin{eqnarray*} \sigma^2 &=& (n-1)^{-1} \sum_k (x_k - \mu)^2 \\ (n-1)(\sigma_{new}^2 - \sigma_{old}^2) &=& \sum_{k=1}^{n-1} ((x_k - \mu_{new})^2 - (x_k - \mu_{old})^2) \\ &&+\ ((x_n' - \mu_{new})^2 - (x_n - \mu_{old})^2) \\ &=& \sum_{k=1}^{n-1} ((x_k - \mu_{old} - n^{-1}(x_n'-x_n))^2 - (x_k - \mu_{old})^2) \\ &&+\ ((x_n' - \mu_{old} - n^{-1}(x_n'-x_n))^2 - (x_n - \mu_{old})^2) \\ \end{eqnarray*}\\ $$

The $x_k$ in the sum transform into something dependent of $\mu_{old}$, but you'll have to work the equation a little bit more to derive a neat result. This should give you the general idea.

$\endgroup$
2
  • $\begingroup$ the first formula you gave does not seem correct, well it means that if the $x_{n+1}$ is smaller/larger then from both new and old mean, the variance always increases, which does not make any sense. It may increase or decrease depending on the distribution. $\endgroup$ Commented May 27, 2014 at 6:09
  • $\begingroup$ @EmmetB: Yes, you're right -- this should probably be $\sigma_{new}^2 = \frac{n-1}{n} \sigma_{old}^2 + \frac{1}{n} (x_{n+1} - \mu_{new})(x_{n+1} - \mu_{old}).$ Unfortunately, this renders void my whole discussion from there, but I'm leaving it for historic purposes. Feel free to edit, though. $\endgroup$ Commented May 27, 2014 at 14:20
4
$\begingroup$

Based on what i think i'm reading on the linked Wikipedia article you can maintain a "running" standard deviation:

real sum = 0; int count = 0; real S = 0; real variance = 0; real GetRunningStandardDeviation(ref sum, ref count, ref S, x) { real oldMean; if (count >= 1) { real oldMean = sum / count; sum = sum + x; count = count + 1; real newMean = sum / count; S = S + (x-oldMean)*(x-newMean) } else { sum = x; count = 1; S = 0; } //estimated Variance = (S / (k-1) ) //estimated Standard Deviation = sqrt(variance) if (count > 1) return sqrt(S / (count-1) ); else return 0; } 

Although in the article they don't maintain a separate running sum and count, but instead have the single mean. Since in thing i'm doing today i keep a count (for statistical purposes), it is more useful to calculate the means each time.

$\endgroup$
1
1
$\begingroup$

Given original $\bar x$, $s$, and $n$, as well as the change of a given element $x_n$ to $x_n'$, I believe your new standard deviation $s'$ will be the square root of $$s^2 + \frac{1}{n-1}\left(2n\Delta \bar x(x_n-\bar x) +n(n-1)(\Delta \bar x)^2\right),$$ where $\Delta \bar x = \bar x' - \bar x$, with $\bar x'$ denoting the new mean.

Maybe there is a snazzier way of writing it?

I checked this against a small test case and it seemed to work.

$\endgroup$
2
  • 1
    $\begingroup$ @john / whistling in the Dark: I liked your answer, it seems work properly in my small dataset. Is there any mathematical foundation/reference on it? Could you kindly help? $\endgroup$ Commented Dec 1, 2017 at 22:32
  • $\begingroup$ The question was all @Whistling in the Dark, I just cleaned it up for the site. You should pose a new question referencing the question and answer here. And also you should upvote this answer if you feel that way. $\endgroup$ Commented Dec 3, 2017 at 19:45
0
$\begingroup$

If you make the assumption that the preliminary data that you have represents all of the values within the population with the relative frequencies, then increasing the sample size as a multiple of $n$ will be like copying the data set and pasting it below and then recalculating the mean and standard deviation.

The mean will remain the same, but the standard deviation will decrease. Using this model, you can derive a formula that allows you to estimate the new standard deviation based on a new sample size, $n$. That formula is $$S_{2} = S_{1} \times \sqrt{\frac{n_2 - (n_2/n_1)}{n_2 - 1} }.$$

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.