# On the utility of the intercept-slope correlation in multilevel models

In their book “Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling” (1999), Snijders & Bosker (ch. 8, section 8.2, page 119) said that the intercept-slope correlation, calculated as the intercept-slope covariance divided by the square root of the product of intercept variance and slope variance, is not bounded between -1 and +1 and can be even infinite.

Given this, I didn’t think I should trust it. But I have an example to illustrate. In one of my analysis, which has race (dichotomy), age and age*race as fixed effects, cohort as random effect, and race dichotomy variable as random slope, my series of scatterplot show that the slope does not vary much across the values of my cluster (i.e., cohort) variable, and I don’t see the slope becoming less or more steeper across cohorts. The Likelihood Ratio Test also shows that the fit between the random intercept and random slope models is not significant despite my total sample size (N=22,156). And yet, the intercept-slope correlation was near -0.80 (which would suggest a strong convergence in group difference in Y variable over time, i.e., across cohorts).

I think it’s a good illustration of why I don’t trust the intercept-slope correlation, on top of what Snijders & Bosker (1999) already said.

Should we really trust and report the intercept-slope correlation in multilevel studies? Specifically, what is the usefulness of such correlation?

EDIT 1: I don’t think it will answer my question, but gung asked me to provide more information. See below, if it helps.

The data is from the General Social Survey. For the syntax, I used Stata 12, so it reads:

xtmixed wordsum bw1 aged1 aged2 aged3 aged4 aged6 aged7 aged8 aged9 bw1aged1 bw1aged2 bw1aged3 bw1aged4 bw1aged6 bw1aged7 bw1aged8 bw1aged9 || cohort21: bw1, reml cov(un) var

• wordsum is a vocabulary test score (0-10),
• bw1 is the ethnic variable (black=0, white=1),
• aged1-aged9 are dummy variables of age,
• bw1aged1-bw1aged9 are the interaction between ethnicity and age,
• cohort21 is my cohort variable (21 categories, coded 0 to 20).

    . xtmixed wordsum bw1 aged1 aged2 aged3 aged4 aged6 aged7 aged8 aged9 bw1aged1 bw1aged2 bw1aged3 bw1aged4 bw1aged6 bw1aged7 bw1aged8 bw1aged9 || cohort21: bw1, reml
> cov(un) var

Performing EM optimization:

Iteration 0:   log restricted-likelihood = -46809.738
Iteration 1:   log restricted-likelihood = -46809.673
Iteration 2:   log restricted-likelihood = -46809.673

Computing standard errors:

Mixed-effects REML regression                   Number of obs      =     22156
Group variable: cohort21                        Number of groups   =        21

Obs per group: min =       307
avg =    1055.0
max =      1728

Wald chi2(17)      =   1563.31
Log restricted-likelihood = -46809.673          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
wordsum |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
bw1 |   1.295614   .1030182    12.58   0.000     1.093702    1.497526
aged1 |  -.7546665    .139246    -5.42   0.000    -1.027584   -.4817494
aged2 |  -.3792977   .1315739    -2.88   0.004    -.6371779   -.1214175
aged3 |  -.1504477   .1286839    -1.17   0.242    -.4026635     .101768
aged4 |  -.1160748   .1339034    -0.87   0.386    -.3785207    .1463711
aged6 |  -.1653243   .1365332    -1.21   0.226    -.4329245     .102276
aged7 |  -.2355365    .143577    -1.64   0.101    -.5169423    .0458693
aged8 |  -.2810572   .1575993    -1.78   0.075    -.5899461    .0278318
aged9 |  -.6922531   .1690787    -4.09   0.000    -1.023641   -.3608649
bw1aged1 |  -.2634496   .1506558    -1.75   0.080    -.5587297    .0318304
bw1aged2 |  -.1059969   .1427813    -0.74   0.458    -.3858431    .1738493
bw1aged3 |  -.1189573   .1410978    -0.84   0.399     -.395504    .1575893
bw1aged4 |    .058361   .1457749     0.40   0.689    -.2273525    .3440746
bw1aged6 |   .1909798   .1484818     1.29   0.198    -.1000393    .4819988
bw1aged7 |   .2117798    .154987     1.37   0.172    -.0919891    .5155486
bw1aged8 |   .3350124    .167292     2.00   0.045     .0071262    .6628987
bw1aged9 |   .7307429   .1758304     4.16   0.000     .3861217    1.075364
_cons |   5.208518   .1060306    49.12   0.000     5.000702    5.416334
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
cohort21: Unstructured       |
var(bw1) |   .0049087    .010795      .0000659    .3655149
var(_cons) |   .0480407   .0271812      .0158491     .145618
cov(bw1,_cons) |  -.0119882    .015875     -.0431026    .0191262
-----------------------------+------------------------------------------------
var(Residual) |   3.988915   .0379483      3.915227     4.06399
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(3) =    85.83   Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.


The scatterplot I produced is shown below. There are nine scatter plots, one for each category of my age variable.

EDIT 2:

. estat recovariance

Random-effects covariance matrix for level cohort21

|       bw1      _cons
-------------+----------------------
bw1 |  .0049087
_cons | -.0119882   .0480407


There is another thing I want to add: What bothers me is that, with regard to the intercept-slope covariance / correlation, Joop J. Hox (2010, p. 90) in his book “Multilevel Analysis Techniques and Applications, Second Edition” said that :

It is easier to interpret this covariance if it is presented as a
correlation between the intercept and slope residuals. … In a model
without other predictors except the time variable, this correlation
can be interpreted as an ordinary correlation, but in models 5 and 6
it is a partial correlation, conditional on the predictors in the
model.

So, it seems that not everyone would agree with Snijders & Bosker (1999, p. 119) who believe that “the idea of a correlation does not make sense here” because it’s not bounded between [-1, 1].

I have emailed several scholars (almost 30 persons) several weeks ago. Few of them sent their mail (always collective emails). Eugene Demidenko was the first to answer :

cov/sqrt(var1*var2) is always within [-1,1] regardless of the interpretation: it may be estimates of intercept and slope, two slopes, etc. The fact that -1<=cov/sqrt(var1*var2)<=1 follows from the Cauchy inequality and it is always true. Thus I dismiss the Snijders & Bosker statement. Maybe some other piece of information is missing?

This was followed by an email from Thomas Snijders :

The information that is missing is what was actually written about this on page 122, 123, 124, 129 of Snijders & Bosker (2nd edition 2012). This is not about two competing claims of which no more than one can be true, it is about two different interpretations.

On p. 123 a quadratic variance function is introduced,
\sigma_0^2 + 2 \sigma_{01} * x + \sigma_1^2 * x^2
and the following remark is made:
“This formula can be used without the interpretation that \sigma_0^2 and \sigma_1^2 are variances and \sigma_{01} a covariance;
these parameters might be any numbers. The formula only implies that the residual variance is a quadratic function of x.

Let me quote a full paragraph of p. 129, about a quadratic variance function at level two; note that ONE MIGHT INTERPRET that \tau_0^2 and \tau_1^2 are the level-two variances of the random intercept and random slope, and \tau_{01} is their covariance, but this is explicitly put behind the horizon:

“The parameters \tau_0^2, \tau_1^2, and \tau_{01} are, as in the preceding section, not to be interpreted themselves as variances and a corresponding covariance. The interpretation is by means of the variance function (8.7) [note t.s.: in the book this is mistakenly reported as 8.8].
Therefore it is not required that \tau_{01}^2 <= \tau_0^2 * \tau_1^2.
To put it another way, ‘correlations’ defined formally by \tau_{01}/(\tau_0 * \tau_1) may be larger than 1 or smaller than -1, even infinite, because the idea of a correlation does not make sense here.
An example of this is provided by the linear variance function for which \tau_1^2 = 0 and only the parameters \tau_0^2 and \tau_{01} are used.”

The variance function is a quadratic function of x (the variable “with the random slope”), and the variance of the outcome is this plus the level-1 variance. As long as this is positive for all x, the modelled variance is positive. (An extra requirement is that the corresponding covariance matrix is positive definite.)

Some further background of this is the existence of differences in parameter estimation algorithms in software. In some multilevel (random effects) software, the requirement is made that the covariance matrices of the random effects are positive semi-definite on all levels. In other software, the requirement is made only that the resulting estimated covariance matrix for the observed data is positive semi-definite. This implies that the idea of random coefficients of latent variables is relinquished, and the model specifies a certain covariance structure for the observed data; no more, no less; in that case the cited interpretation of Joop Hox does not apply. Note that Harvey Goldstein already long ago used linear variance functions at level one, represented by a zero slope variance and nonzero slope-intercept correlation at level one; this was and is called “complex variation”; see, e.g.,
http://www.bristol.ac.uk/media-library/sites/cmm/migrated/documents/modelling-complex-variation.pdf

And then, Joop Hox replied :

In the software MLwiN it is actually possible to estimate a covariance term and at the same time constrain one of the variances to zero, which would make the “correlation” infinite. And yes, some software will allow estimates such as negative variances (SEM software usually allows this). So my statements were not completely accurate. I refered to “normal” unstructured random structures. Let me add that if you rescale the variable with the random slope to have a different zero-point, the variances and covariances generally change. So the correlation is only interpretable if the predictor variable has a fixed zero-point, i.e. is measured on a ratio scale. This applies to growth curve models, where the correlation between initial status and rate of growth is sometimes interpreted. In that case the value zero should be the ‘real’ time point where the process starts.

And he sent another mail :

Anyway, I think Tom’s explanation below fits the style of the Snijders/Bosker collaboration better than my more informal style. I would add to page 90 a footnote stating something like “Note that the parameter values in the random part are estimates. Interpreting the standardized covariances as ordinary correlations assumes that there are no constraints on the variances and that the software does not allow negative estimates. If the random part is unstructured the interpretation as ordinary (co)variances is generally tenable.”.

Note that I wrote about the correlation interpretation in the longitudinal chapter. In growth curve modeling it is very tempting to interpret this correlation as a substantive result, and that is dangerous because the value depends on the “metric of time”. If you are interested in that I recommend to go to Lesa Hoffman’s website (http://www.lesahoffman.com/).

So I think in my situation, where I’ve specified an unstructured covariance for the random effects, I should interpret the intercept-slope correlation as an ordinary correlation.