What does pooled variance “actually” mean?

I am a noob in statistics, so could you guys please help me out here.

My question is the following: What does pooled variance actually mean?

When I look for a formula for pooled variance in the internet, I find a lot of literature using the following formula (for example, here: http://math.tntech.edu/ISR/Mathematical_Statistics/Introduction_to_Statistical_Tests/thispage/newnode19.html):

\begin{equation} \label{eq:stupidpooledvar}
\displaystyle S^2_p = \frac{S_1^2 (n_1-1) + S_2^2 (n_2-1)}{n_1 + n_2 – 2}
\end{equation}

But what does it actually calculate? Because when I use this formula to calculate my pooled variance, it gives me wrong answer.

For example, consider these “parent sample”:

\begin{equation} \label{eq:parentsample}
2,2,2,2,2,8,8,8,8,8
\end{equation}

The variance of this parent sample is $S^2_p=10$, and its mean is $\bar{x}_p=5$.

Now, suppose I split this parent sample into two sub-samples:

  1. The first sub-sample is 2,2,2,2,2 with mean $\bar{x}_1=2$ and variance $S^2_1=0$.
  2. The second sub-sample is 8,8,8,8,8 with mean $\bar{x}_2=8$ and variance $S^2_2=0$.

Now, clearly, using the above formula to calculate the pooled/parent variance of these two sub-samples will produce zero, because $S_1=0$ and $S_2=0$. So what does this formula actually calculate?

On the other hand, after some lengthy derivation, I found the formula which produces the correct pooled/parent variance is:

\begin{equation} \label{eq:smartpooledvar}
\displaystyle S^2_p = \frac{S_1^2 (n_1-1) + n_1 d_1^2 + S_2^2 (n_2-1) + n_2 d_2^2} {n_1 + n_2 – 1}
\end{equation}

In the above formula, $d_1=\bar{x_1}-\bar{x}_p$ and $d_2=\bar{x_2}-\bar{x}_p$.

I found a similar formula with mine, for example here: http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html
and also in Wikipedia. Although I have to admit that they don’t look exactly the same like mine.

So again, what does pooled variance actually mean? Shouldn’t it mean the variance of parent sample from the two sub-samples? Or I am completely wrong here?

Thank you in advance.


EDIT 1: Someone says that my two sub-samples above are pathological since they have zero variance. Well, I could give you a different example. Consider this parent sample:

\begin{equation} \label{eq:parentsample2}
1,2,3,4,5,46,47,48,49,50
\end{equation}

The variance of this parent sample is $S^2_p=564.7$, and its mean is $\bar{x}_p=25.5$.

Now, suppose I split this parent sample into two sub-samples:

  1. The first sub-sample is 1,2,3,4,5 with mean $\bar{x}_1=3$ and variance $S^2_1=2.5$.
  2. The second sub-sample is 46,47,48,49,50 with mean $\bar{x}_2=48$ and variance $S^2_2=2.5$.

Now, if you use “literature’s formula” to compute the pooled variance, you will get 2.5, which is completely wrong, because the parent/pooled variance should be 564.7. Instead, if you use “my formula”, you will get correct answer.

Please understand, I use extreme examples here to show people that the formula indeed wrong. If I use “normal data” which doesn’t have a lot of variations (extreme cases), then the results from those two formulae will be very similar, and people could dismiss the difference due to rounding error, not because the formula itself is wrong.

Answer

Put simply, the pooled variance is an (unbiased) estimate of the variance within each sample, under the assumption/constraint that those variances are equal.

This is explained, motivated, and analyzed in some detail in the Wikipedia entry for pooled variance.

It does not estimate the variance of a new “meta-sample” formed by concatenating the two individual samples, like you supposed. As you have already discovered, estimating that requires a completely different formula.

Attribution
Source : Link , Question Author : Hanciong , Answer Author : Jake Westfall

Leave a Comment