# Why are principal components in PCA (eigenvectors of the covariance matrix) mutually orthogonal? [duplicate]

Why are principal components in PCA mutually orthogonal? I know that PCA can be calculated by eig(cov(X)), where X is centered. But I do not see why the eigenvectors should be orthogonal.

The covariance matrix is symmetric. If a matrix $A$ is symmetric, and has two eigenvectors $u$ and $v$, consider $Au = \lambda u$ and $Av = \mu v$.

Then by symmetry (and writing $'$ for transpose):

More directly:

Since these are equal we obtain $(\lambda - \mu)u'v =0$. So either $u'v = 0$ and the two vectors are orthogonal, or $\lambda - \mu = 0$ and the two eigenvalues are equal. In the latter case, the eigenspace for that repeated eigenvalue can contain eigenvectors which are not orthogonal. So your instinct to question why the eigenvectors have to be orthogonal was a good one; if there are repeated eigenvalues they may not be! What if your sample covariance is the identity matrix? This has repeated eigenvalue $1$ and any two non-zero vectors are eigenvectors, orthogonal or not. (Thinking out such special cases is often a good way to spot counter-examples.)

If a symmetric matrix has a repeated eigenvalue, we can choose to pick out orthogonal eigenvectors from its eigenspace. That’s what we want to do in PCA, because finding orthogonal components is the whole point of the exercise. Of course it’s unlikely that your sample covariance matrix will have repeated eigenvalues – if so, it would only have taken a small perturbation of your data to make them unequal – but we should take care to define our algorithm so it really does pick out orthogonal eigenvectors. (Note that which it picks out, and in what order, is arbitrary. Think back to the identity matrix and all its possible orthogonal sets of eigenvectors! This is a generalisation of the arbitrary choice between $v$ and $-v$ for a unique eigenvalue. Output from two different implementations of PCA may look quite different.)

To see why it’s guaranteed that we can set up our implementation of eig(cov(X)) in this manner – in other words, why there are always “just enough” orthogonal vectors we can pick out from that eigenspace – you need to understand why geometric and algebraic multiplicities are equal for symmetric matrices. If the eigenvalue appears twice we can pick out two orthogonal eigenvectors; thrice and we can pick out three, and so on. Several approaches are raised in this mathematics stack exchange thread but the usual method is via the Schur decomposition. The result you are after is probably proved in your linear algebra textbook as the “spectral theorem” (though that phrase can also refer to several more general results) or perhaps under a more specific name like “symmetric eigenvalue decomposition”. Symmetric matrices have several nice properties that it’s worth knowing, e.g. their eigenvalues are real, so we can find real eigenvectors, with obvious implications for PCA.

Finally, how can we write an implementation that achieves this? I will consider two implementations of PCA in R. We can see the code for princomp: look at methods(princomp) then getAnywhere(princomp.default) and we observe edc <- eigen(cv, symmetric = TRUE). So eigen will use LAPACK routines for symmetric matrices. Checking the LAPACK Users’ Guide (3rd edition) for “symmetric eigenproblems” we see it firstly decomposes $A = QTQ'$ where $T$ is symmetric tridiagonal and $Q$ is orthogonal, then decomposes $T = S \Lambda S'$ where $\Lambda$ is diagonal and $S$ orthogonal. Then writing $Z = QS$ we have diagonalized $A = Z \Lambda Z'$. Here $\Lambda$ is the vector of eigenvalues (of $T$ and also of $A$ – they work out the same) and since $Z$ is the product of two orthogonal matrices it is also orthogonal. The computed eigenvectors are the columns of $Z$ so we can see LAPACK guarantees they will be orthonormal (if you want to know quite how the orthogonal vectors of $T$ are picked, using a Relatively Robust Representations procedure, have a look at the documentation for DSYEVR). So that’s one approach, but for numerical reasons it’d be better to do a singular value decomposition. If you look under the bonnet of another PCA function in R, you’ll see this is how prcomp works. R uses a different bunch of LAPACK routines to solve this problem.