Why does PCA maximize total variance of the projection?

Christopher Bishop writes in his book Pattern Recognition and Machine Learning a proof, that each consecutive principal component maximizes the variance of the projection to one dimension, after the data has been projected to orthogonal space to the previously selected components. Others show similar proofs.

However, this only proves that each consecutive component is the best projection to one dimension, in terms of maximizing the variance. Why does this imply, that variance of a projection to say 5 dimensions is maximized choosing first such components?

What is understood by variance in several dimensions (“total variance”) is simply a sum of variances in each dimension. Mathematically, it’s a trace of the covariance matrix: trace is simply a sum of all diagonal elements. This definition has various nice properties, e.g. trace is invariant under orthogonal linear transformations, which means that if you rotate your coordinate axes, the total variance stays the same.

What is proved in Bishop’s book (section 12.1.1), is that the leading eigenvector of covariance matrix gives the direction of maximal variance. Second eigenvector gives the direction of maximal variance under an additional constraint that it should be orthogonal to the first eigenvector, etc. (I believe this constitutes the Exercise 12.1). If the goal is to maximize the total variance in the 2D subspace, then this procedure is a greedy maximization: first choose one axis that maximizes variance, then another one.

Your question is: why does this greedy procedure obtain a global maximum?

Here is a nice argument that @whuber suggested in the comments. Let us first align the coordinate system with the PCA axes. The covariance matrix becomes diagonal: $\boldsymbol{\Sigma} = \mathrm{diag}(\lambda_i)$. For simplicity we will consider the same 2D case, i.e. what is the plane with maximal total variance? We want to prove that it is the plane given by the first two basis vectors (with total variance $\lambda_1+\lambda_2$).

Consider a plane spanned by two orthogonal vectors $\mathbf{u}$ and $\mathbf{v}$. The total variance in this plane is So it is a linear combination of eigenvalues $\lambda_i$ with coefficients that are all positive, do not exceed $1$ (see below), and sum to $2$. If so, then it is almost obvious that the maximum is reached at $\lambda_1 + \lambda_2$.

It is only left to show that the coefficients cannot exceed $1$. Notice that $u_k^2+v_k^2 = (\mathbf{u}\cdot\mathbf{k})^2+(\mathbf{v}\cdot\mathbf{k})^2$, where $\mathbf{k}$ is the $k$-th basis vector. This quantity is a squared length of a projection of $\mathbf k$ onto the plane spanned by $\mathbf u$ and $\mathbf v$. Therefore it has to be smaller than the squared length of $\mathbf k$ which is equal to $|\mathbf{k}|^2=1$, QED.

See also @cardinal’s answer to What is the objective function of PCA? (it follows the same logic).