# Showing that 100 measurements for 5 subjects provide much less information than 5 measurements for 100 subjects

At a conference I overheard the following statement:

100 measurements for 5 subjects provide much less information than 5 measurements for 100 subjects.

It’s kind of obvious that this is true, but I was wondering how one could prove it mathematically… I think a linear mixed model could be used. However, I don’t know much about the math used for estimating them (I just run lmer4 for LMMs and bmrs for GLMMs 🙂 Could you show me an example where this is true? I’d prefer an answer with some formulas, than just some code in R. Feel free to assume a simple setting, such as for example linear mixed model with normally distributed random intercepts and slopes.

PS a math-based answer which doesn’t involve LMMs would be ok too. I thought of LMMs because they seemed to me the natural tool to explain why less measures from more subjects are better than more measures from few subjects, but I may well be wrong.

The short answer is that your conjecture is true when and only when there is a positive intra-class correlation in the data. Empirically speaking, most clustered datasets most of the time show a positive intra-class correlation, which means that in practice your conjecture is usually true. But if the intra-class correlation is 0, then the two cases you mentioned are equally informative. And if the intra-class correlation is negative, then it’s actually less informative to take fewer measurements on more subjects; we would actually prefer (as far as reducing the variance of the parameter estimate is concerned) to take all our measurements on a single subject.

Statistically there are two perspectives from which we can think about this: a random-effects (or mixed) model, which you mention in your question, or a marginal model, which ends up being a bit more informative here.

## Random-effects (mixed) model

Say we have a set of $n$ subjects from whom we’ve taken $m$ measurements each. Then a simple random-effects model of the $j$th measurement from the $i$th subject might be
$$y_{ij} = \beta + u_i + e_{ij},$$
where $\beta$ is the fixed intercept, $u_i$ is the random subject effect (with variance $\sigma^2_u$), $e_{ij}$ is the observation-level error term (with variance $\sigma^2_e$), and the latter two random terms are independent.

In this model $\beta$ represents the population mean, and with a balanced dataset (i.e., an equal number of measurements from each subject), our best estimate is simply the sample mean. So if we take “more information” to mean a smaller variance for this estimate, then basically we want to know how the variance of the sample mean depends on $n$ and $m$. With a bit of algebra we can work out that
\begin{aligned} \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) &= \text{var}(\frac{1}{nm}\sum_i\sum_j\beta + u_i + e_{ij}) \\ &= \frac{1}{n^2m^2}\text{var}(\sum_i\sum_ju_i + \sum_i\sum_je_{ij}) \\ &= \frac{1}{n^2m^2}\Big(m^2\sum_i\text{var}(u_i) + \sum_i\sum_j\text{var}(e_{ij})\Big) \\ &= \frac{1}{n^2m^2}(nm^2\sigma^2_u + nm\sigma^2_e) \\ &= \frac{\sigma^2_u}{n} + \frac{\sigma^2_e}{nm}. \end{aligned}
Examining this expression, we can see that whenever there is any subject variance (i.e., $\sigma^2_u>0$), increasing the number of subjects ($n$) will make both of these terms smaller, while increasing the number of measurements per subject ($m$) will only make the second term smaller. (For a practical implication of this for designing multi-site replication projects, see this blog post I wrote a while ago.)

Now you wanted to know what happens when we increase or decrease $m$ or $n$ while holding constant the total number of observations. So for that we consider $nm$ to be a constant, so that the whole variance expression just looks like
$$\frac{\sigma^2_u}{n} + \text{constant},$$
which is as small as possible when $n$ is as large as possible (up to a maximum of $n=nm$, in which case $m=1$, meaning we take a single measurement from each subject).

My short answer referred to the intra-class correlation, so where does that fit in? In this simple random-effects model the intra-class correlation is
$$\rho = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2_e}$$
(sketch of a derivation here). So we can write the variance equation above as
$$\text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) = \frac{\sigma^2_u}{n} + \frac{\sigma^2_e}{nm} = \Big(\frac{\rho}{n} + \frac{1-\rho}{nm}\Big)(\sigma^2_u+\sigma^2_e)$$
This doesn’t really add any insight to what we already saw above, but it does make us wonder: since the intra-class correlation is a bona fide correlation coefficient, and correlation coefficients can be negative, what would happen (and what would it mean) if the intra-class correlation were negative?

In the context of the random-effects model, a negative intra-class correlation doesn’t really make sense, because it implies that the subject variance $\sigma^2_u$ is somehow negative (as we can see from the $\rho$ equation above, and as explained here and here)… but variances can’t be negative! But this doesn’t mean that the concept of a negative intra-class correlation doesn’t make sense; it just means that the random-effects model doesn’t have any way to express this concept, which is a failure of the model, not of the concept. To express this concept adequately we need to consider the marginal model.

## Marginal model

For this same dataset we could consider a so-called marginal model of $y_{ij}$,
$$y_{ij} = \beta + e^*_{ij},$$
where basically we’ve pushed the random subject effect $u_i$ from before into the error term $e_{ij}$ so that we have $e^*_{ij} = u_i + e_{ij}$. In the random-effects model we considered the two random terms $u_i$ and $e_{ij}$ to be i.i.d., but in the marginal model we instead consider $e^*_{ij}$ to follow a block-diagonal covariance matrix $\textbf{C}$ like
$$\textbf{C}= \sigma^2\begin{bmatrix} \textbf{R} & 0& \cdots & 0\\ 0& \textbf{R} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0& 0& \cdots &\textbf{R}\\ \end{bmatrix}, \textbf{R}= \begin{bmatrix} 1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \cdots &1\\ \end{bmatrix}$$
In words, this means that under the marginal model we simply consider $\rho$ to be the expected correlation between two $e^*$s from the same subject (we assume the correlation across subjects is 0). When $\rho$ is positive, two observations drawn from the same subject tend to be more similar (closer together), on average, than two observations drawn randomly from the dataset while ignoring the clustering due to subjects. When $\rho$ is negative, two observations drawn from the same subject tend to be less similar (further apart), on average, than two observations drawn completely at random. (More information about this interpretation in the question/answers here.)

So now when we look at the equation for the variance of the sample mean under the marginal model, we have
\begin{aligned} \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) &= \text{var}(\frac{1}{nm}\sum_i\sum_j\beta + e^*_{ij}) \\ &= \frac{1}{n^2m^2}\text{var}(\sum_i\sum_je^*_{ij}) \\ &= \frac{1}{n^2m^2}\Big(n\big(m\sigma^2 + (m^2-m)\rho\sigma^2\big)\Big) \\ &= \frac{\sigma^2\big(1+(m-1)\rho\big)}{nm} \\ &= \Big(\frac{\rho}{n}+\frac{1-\rho}{nm}\Big)\sigma^2, \end{aligned}
which is the same variance expression we derived above for the random-effects model, just with $\sigma^2_e+\sigma^2_u=\sigma^2$, which is consistent with our note above that $e^*_{ij} = u_i + e_{ij}$. The advantage of this (statistically equivalent) perspective is that here we can think about a negative intra-class correlation without needing to invoke any weird concepts like a negative subject variance. Negative intra-class correlations just fit naturally in this framework.

(BTW, just a quick aside to point out that the second-to-last line of the derivation above implies that we must have $\rho \ge -1/(m-1)$, or else the whole equation is negative, but variances can’t be negative! So there is a lower bound on the intra-class correlation that depends on how many measurements we have per cluster. For $m=2$ (i.e., we measure each subject twice), the intra-class correlation can go all the way down to $\rho=-1$; for $m=3$ it can only go down to $\rho=-1/2$; and so on. Fun fact!)

So finally, once again considering the total number of observations $nm$ to be a constant, we see that the second-to-last line of the derivation above just looks like
$$\big(1+(m-1)\rho\big) \times \text{positive constant}.$$
So when $\rho>0$, having $m$ as small as possible (so that we take fewer measurements of more subjects–in the limit, 1 measurement of each subject) makes the variance of the estimate as small as possible. But when $\rho<0$, we actually want $m$ to be as large as possible (so that, in the limit, we take all $nm$ measurements from a single subject) in order to make the variance as small as possible. And when $\rho=0$, the variance of the estimate is just a constant, so our allocation of $m$ and $n$ doesn’t matter.