Is there a handy plot for comparing the variance-covariance matrices of two (or perhaps more) groups? An alternative to looking at lots of marginal plots, especially in the multivariate Normal case?
An article Visualizing Tests for Equality of Covariance Matrices, by Michael Friendly and Matthew Sigal, has just appeared in print in The American Statistician (Volume 74, 2020 – Issue 2, pp 144-155). It suggests several graphical procedures to compare covariance matrices.
heplot supports these procedures. The illustrations in this post are modifications of those in the article based on the supplemental code maintained at https://github.com/mattsigal/eqcov_supp/blob/master/iris-ex.R. (I have removed some distracting graphical elements.)
Let’s go there step by step, using the well-known Iris dataset, which will require us to compare three covariance matrices of $d=4$ variables.
Here is a scatterplot of two of its four variables with symbol size and color distinguishing the three species of Iris.
As usual, the first two bivariate moments of any group can be depicted using a covariance ellipse. It is a contour of the Mahalanobis distance centered at the point of means. The software shows two such contours, presumably estimating 68% and 95% tolerance ellipses (for bivariate Normal distributions). (The contour levels are found, as usual, by referring to quantiles of a suitable chi-squared distribution.)
Provided the data don’t have outliers and strong nonlinearities, these provide a nice visual summary, as we can see simply by erasing the data:
The first innovation is to plot a pooled covariance ellipse. This is obtained by first recovering the sums of squares and products matrices upon multiplication of each covariance matrix by the degrees of freedom in its estimation. Those SSP matrices are then summed (componentwise, of course) and the result is divided by the total degrees of freedom. We may distinguish the pooled covariance ellipse by shading it:
The second innovation translates all ellipses to a common center:
For example, the Virginica covariance is similar to the Versicolor covariance but tends to be larger. The Setosa covariance is smaller and oriented differently, clearly distinguishing the Setosa sepal width-length relationship from that of the other two species.
(Note that because the contour level (such as 68% or 95%) merely rescales all ellipses equally, the choice of which level to use for this plot is no longer material.)
The final innovation emulates the scatterplot matrix: with $d \gt 2$ variables, create a $d\times d$ array doubly indexed by those variables and, in the cell for variables “X” and “Y,” draw all the covariance ellipses for those two variables, including the pooled ellipse. Distinguish the covariances graphically using the line style for the contours and/or a fill style for the polygons they bound. Choose a relatively prominent style for the pooled ellipse: here, it is the only one that is filled and it has the darkest boundary.
A pattern emerges in which the Setosa covariance matrix departs from those for the other two species and that of Virginica (still shown in red) tends to exhibit larger values overall.
Although this “bivariate slicing” approach doesn’t allow us to see everything that’s going on in these covariance matrices, the visualization is a pretty good start at making a reasoned comparison of covariance matrices. Further simplification of the graphical representation is possible (using design principles inspired by, say, Tufte or Bertin) and, I think, likely to make this approach even more effective.
When $d$ grows large (in my experience, greater than $8$ becomes unwieldy unless you’re willing to produce output on a high-resolution large-format printer, but even then $40$ is around an upper limit), some kind of variance reduction technique is called for. Friendly and Sigal explore PCA solutions. Of interest are the applications that focus on the principal components with smallest eigenvalues.