I am analysing data on 300,000 pupils in 175 schools with a logistic linear mixed effects model (random intercepts). Each pupil occurs exactly once and the data spans 6 years.
How do I partition variance between the school and pupil levels, in a similar way to the VPC/ICC for continuous outcomes ? I have seen this article which proposes 4 methods, of which A and B appear interesting to me, but I would like to know what advantages/drawbacks there could be in using either of these, and of course if there are any other ways to do it.
How can I compare the schoollevel residual variance from year to year (or any other time period) ? So far I have done this by dividing the data by year and running the model against each year of data but I think this is flawed because: i) there is no obvious reason why I should be split by year; and ii) since the fixed effects estimates are different for each year, comparing the random effects year by year may not make sense (this is just my intuition it would be great if someone could explain this more formally, if it is correct).
NOTE: I rewrote this question after a discussion in meta with whuber and Macro
Answer
Let y_{ij}, {\boldsymbol x}_{ij} denote the response and predictor vector (respectively) of student i in school j.
(1) For binary data, I think the standard way to do variance decompositions analogous to those done for continuous data is what the authors call Method D (I’ll comment on the other methods below) in your link – envisioning the binary data as arising from a underlying continuous variable that is governed by a linear model and decompose the variance on that latent scale. The reason is that logistic models (and other GLMs) naturally arises this way–
To see this, define y^{\star}_{ij} such that it is governed by a linear mixed model:
y^{\star}_{ij} = \alpha + {\boldsymbol x}_{ij} {\boldsymbol \beta} + \eta_j + \varepsilon_{ij}
where \alpha,\beta are regression coefficients, \eta_j \sim N(0,\sigma^2) is the school level random effect and \varepsilon_{ij} is the residual variance term and has a standard logistic distribution. Now let
y_{ij} =
\begin{cases} 1 & \text{if} \ \ \ y^{\star}_{ij}≥0\\
\\
0 &\text{if} \ \ \ y^{\star}_{ij}<0
\end{cases}
let p_{ij} = P(y_{ij} = 1{\boldsymbol x}_{ij},\eta_j) now, simply using the logistic CDF we have
p_{ij} = 1P(y^{\star}_{ij}<0{\boldsymbol x}_{ij},\eta_j) = \frac{ \exp \{(\alpha + {\boldsymbol x}_{ij} {\boldsymbol \beta} + \eta_j) \} }{1+ \exp \{(\alpha + {\boldsymbol x}_{ij} {\boldsymbol \beta} + \eta_j) \}}
now taking the logit transform of both sides, you have
\log \left( \frac{ p_{ij} }{1  p_{ij}} \right) = \alpha + {\boldsymbol x}_{ij} {\boldsymbol \beta} + \eta_j
which is exactly the logistic mixed effects model. So, the logistic model is equivalent to the latent variable model specified above. One important note:
 The scale of \varepsilon_{ij} is not identified since, if you were to scale it down but a constant s, it would simply change the above to
\frac{ \exp \{(\alpha + {\boldsymbol x}_{ij} {\boldsymbol \beta} + \eta_j)/s \} }{1+ \exp \{(\alpha + {\boldsymbol x}_{ij} {\boldsymbol \beta} + \eta_j)/s \}}
\ \ \ \ \ \ \ therefore the coefficients and random effects would simply be scaled up by the
\ \ \ \ \ \ corresponding amount. So, s=1 is used, which implies {\rm var}(\varepsilon_{ij}) = \pi^2/3.
Now, if you use this model and then the quantity
\frac{ \hat{\sigma}^{2}_{\eta} }{\hat{\sigma}^{2}_{\eta} + \pi^2/3 }
estimates the intraclass correlation of the underlying latent variables. Another important note:
 If \varepsilon_{ij} is specified as, instead, having a standard normal distribution, then you have the mixed effects probit model. In that case \frac{ \hat{\sigma}^{2}_{\eta} }{\hat{\sigma}^{2}_{\eta} + 1 } estimates the tetrachoric correlation between two randomly selected pupils in the same school, which were shown by Pearson (around 1900 I think) to be statistically identified when the underlying continuous data was normally distributed (this work actually showed these correlations were identified beyond the binary case to the multiple category case, where these correlations are termed polychoric correlations). For this reason, it may be preferable (and would be my recommenation) to use a probit model when the primary interest is in estimating the (tetrachoric) intraclass correlation of binary data.
Regarding the other methods mentioned in the paper you linked:

(A) I've never seen the linearization method, but one drawback I can see is that there's no indication of the approximation error incurred by this. In addition, if you're going to linearize the model (through a potentially crude approximation), why not just use a linear model in the first place (e.g. option (C), which I'll get to in a minute)? It would also be more complicated to present since the ICC would depend on {\boldsymbol x}_{ij}.

(B) The simulation method is intuitively appealing to a statistician since it would give you an estimated variance decomposition on the original scale of the data but, depending on the audience, it may (i) be complicated to describe this in your "methods" section and (ii) may turn off a reviewer who was looking for something "more standard"

(C) Pretending the data is continuous is probably not a great idea, although it won't perform terribly if most of the probabilities are not too close to 0 or 1. But, doing this would almost certainly raise a red flag to a reviewer so I'd stay away.
Now finally,
(2) If the fixed effects are very different across years, then you're right to think that it could be difficult to compare the random effect variances across years, since they are potentially on different scales (this is related to the nonidentifiability of scaling issue mentioned above).
If you want to keep the fixed effects over time (however, if you see them changing a lot over time, you may not want to do that) but look at the change in the random effect variance, you can explore this effect using some random slopes and dummy variables. For example, if you wanted to see if the ICCs were different in different years, you culd let I_k = 1 if the observation was made in year k and 0 otherwise and then model your linear predictor as
\alpha + {\boldsymbol x}_{ij} {\boldsymbol \beta} + \eta_{1j} I_1 + \eta_{2j} I_2 +
\eta_{3j} I_3 + \eta_{4j} I_4 + \eta_{5j} I_5+ \eta_{6j} I_6
this will give you a different ICCs each year but the same fixed effects. It may be tempting to just use a random slope in time, making your linear predictor
\alpha + {\boldsymbol x}_{ij} {\boldsymbol \beta} + \eta_{1} + \eta_{2} t
but I don't recommend this, since that will only allow your associations to increase over time, not decrease.
Attribution
Source : Link , Question Author : Joe King , Answer Author : Macro