Geometric understanding of PCA in the subject (dual) space

I am trying to get an intuitive understanding of how principal component analysis (PCA) works in subject (dual) space.

Consider 2D dataset with two variables, $x_1$ and $x_2$, and $n$ data points (data matrix $\mathbf X$ is $n\times 2$ and is assumed to be centered). The usual presentation of PCA is that we consider $n$ points in $\mathbb R^2$, write down the $2\times 2$ covariance matrix, and find its eigenvectors & eigenvalues; first PC corresponds to the direction of maximum variance, etc. Here is an example with covariance matrix $\mathbf C = \left(\begin{array}{cc}4&2\\2&2\end{array}\right)$. Red lines show eigenvectors scaled by the square roots of the respective eigenvalues.

$\hskip 1in$

Now consider what happens in the subject space (I learned this term from @ttnphns), also known as dual space (the term used in machine learning). This is a $n$-dimensional space where the samples of our two variables (two columns of $\mathbf X$) form two vectors $\mathbf x_1$ and $\mathbf x_2$. The squared length of each variable vector is equal to its variance, the cosine of the angle between the two vectors is equal to the correlation between them. This representation, by the way, is very standard in treatments of multiple regression. In my example, the subject space looks like that (I only show the 2D plane spanned by the two variable vectors):

$\hskip 1in$

Principal components, being linear combinations of the two variables, will form two vectors $\mathbf p_1$ and $\mathbf p_2$ in the same plane. My question is: what is the geometrical understanding/intuition of how to form principal component variable vectors using the original variable vectors on such a plot? Given $\mathbf x_1$ and $\mathbf x_2$, what geometric procedure would yield $\mathbf p_1$?

Below is my current partial understanding of it.

First of all, I can compute principal components/axes through the standard method and plot them on the same figure:

$\hskip 1in$

Moreover, we can note that the $\mathbf p_1$ is chosen such that the sum of squared distances between $\mathbf x_i$ (blue vectors) and their projections on $\mathbf p_1$ is minimal; those distances are reconstruction errors and they are shown with black dashed lines. Equivalently, $\mathbf p_1$ maximizes the sum of the squared lengths of both projections. This fully specifies $\mathbf p_1$ and of course is completely analogous to the similar description in the primary space (see the animation in my answer to Making sense of principal component analysis, eigenvectors & eigenvalues). See also the first part of @ttnphns’es answer here.

However, this is not geometric enough! It doesn’t tell me how to find such $\mathbf p_1$ and does not specify its length.

My guess is that $\mathbf x_1$, $\mathbf x_2$, $\mathbf p_1$, and $\mathbf p_2$ all lie on one ellipse centered at $0$ with $\mathbf p_1$ and $\mathbf p_2$ being its main axes. Here is how it looks like in my example:

$\hskip 1in$

Q1: How to prove that? Direct algebraic demonstration seems to be very tedious; how to see that this must be the case?

But there are many different ellipses centered at $0$ and passing through $\mathbf x_1$ and $\mathbf x_2$:

$\hskip 1in$

Q2: What specifies the “correct” ellipse? My first guess was that it’s the ellipse with the longest possible main axis; but it seems to be wrong (there are ellipses with main axis of any length).

If there are answers to Q1 and Q2, then I would also like to know if they generalize to the case of more than two variables.

All the summaries of $\mathbf X$ displayed in the question depend only on its second moments; or, equivalently, on the matrix $\mathbf{X^\prime X}$. Because we are thinking of $\mathbf X$ as a point cloud–each point is a row of $\mathbf X$–we may ask what simple operations on these points preserve the properties of $\mathbf{X^\prime X}$.

One is to left-multiply $\mathbf X$ by an $n\times n$ matrix $\mathbf U$, which would produce another $n\times 2$ matrix $\mathbf{UX}$. For this to work, it is essential that

Equality is guaranteed when $\mathbf{U^\prime U}$ is the $n\times n$ identity matrix: that is, when $\mathbf{U}$ is orthogonal.

It is well known (and easy to demonstrate) that orthogonal matrices are products of Euclidean reflections and rotations (they form a reflection group in $\mathbb{R}^n$). By choosing rotations wisely, we can dramatically simplify $\mathbf{X}$. One idea is to focus on rotations that affect only two points in the cloud at a time. These are particularly simple, because we can visualize them.

Specifically, let $(x_i, y_i)$ and $(x_j, y_j)$ be two distinct nonzero points in the cloud, constituting rows $i$ and $j$ of $\mathbf{X}$. A rotation of the column space $\mathbb{R}^n$ affecting only these two points converts them to

What this amounts to is drawing the vectors $(x_i, x_j)$ and $(y_i, y_j)$ in the plane and rotating them by the angle $\theta$. (Notice how the coordinates get mixed up here! The $x$‘s go with each other and the $y$‘s go together. Thus, the effect of this rotation in $\mathbb{R}^n$ will not usually look like a rotation of the vectors $(x_i, y_i)$ and $(x_j, y_j)$ as drawn in $\mathbb{R}^2$.)

By choosing the angle just right, we can zero out any one of these new components. To be concrete, let’s choose $\theta$ so that

This makes $x_j^\prime=0$. Choose the sign to make $y_j^\prime \ge 0$. Let’s call this operation, which changes points $i$ and $j$ in the cloud represented by $\mathbf X$, $\gamma(i,j)$.

Recursively applying $\gamma(1,2), \gamma(1,3), \ldots, \gamma(1,n)$ to $\mathbf{X}$ will cause the first column of $\mathbf{X}$ to be nonzero only in the first row. Geometrically, we will have moved all but one point in the cloud onto the $y$ axis. Now we may apply a single rotation, potentially involving coordinates $2, 3, \ldots, n$ in $\mathbb{R}^n$, to squeeze those $n-1$ points down to a single point. Equivalently, $X$ has been reduced to a block form

with $\mathbf{0}$ and $\mathbf{z}$ both column vectors with $n-1$ coordinates, in such a way that

This final rotation further reduces $\mathbf{X}$ to its upper triangular form

In effect, we can now understand $\mathbf{X}$ in terms of the much simpler $2\times 2$ matrix $\pmatrix{x_1^\prime & y_1^\prime \\ 0 & ||\mathbf{z}||}$ created by the last two nonzero points left standing.

To illustrate, I drew four iid points from a bivariate Normal distribution and rounded their values to

This initial point cloud is shown at the left of the next figure using solid black dots, with colored arrows pointing from the origin to each dot (to help us visualize them as vectors).

The sequence of operations effected on these points by $\gamma(1,2), \gamma(1,3),$ and $\gamma(1,4)$ results in the clouds shown in the middle. At the very right, the three points lying along the $y$ axis have been coalesced into a single point, leaving a representation of the reduced form of $\mathbf X$. The length of the vertical red vector is $||\mathbf{z}||$; the other (blue) vector is $(x_1^\prime, y_1^\prime)$.

Notice the faint dotted shape drawn for reference in all five panels. It represents the last remaining flexibility in representing $\mathbf X$: as we rotate the first two rows, the last two vectors trace out this ellipse. Thus, the first vector traces out the path

while the second vector traces out the same path according to

We may avoid tedious algebra by noting that because this curve is the image of the set of points $\{(\cos(\theta), \sin(\theta))\,:\, 0 \le \theta\lt 2\pi\}$ under the linear transformation determined by

it must be an ellipse. (Question 2 has now been fully answered.) Thus there will be four critical values of $\theta$ in the parameterization $(1)$, of which two correspond to the ends of the major axis and two correspond to the ends of the minor axis; and it immediately follows that simultaneously $(2)$ gives the ends of the minor axis and major axis, respectively. If we choose such a $\theta$, the corresponding points in the point cloud will be located at the ends of the principal axes, like this:

Because these are orthogonal and are directed along the axes of the ellipse, they correctly depict the principal axes: the PCA solution. That answers Question 1.

The analysis given here complements that of my answer at Bottom to top explanation of the Mahalanobis distance. There, by examining rotations and rescalings in $\mathbb{R}^2$, I explained how any point cloud in $p=2$ dimensions geometrically determines a natural coordinate system for $\mathbb{R}^2$. Here, I have shown how it geometrically determines an ellipse which is the image of a circle under a linear transformation. This ellipse is, of course, an isocontour of constant Mahalanobis distance.

Another thing accomplished by this analysis is to display an intimate connection between QR decomposition (of a rectangular matrix) and the Singular Value Decomposition, or SVD. The $\gamma(i,j)$ are known as Givens rotations. Their composition constitutes the orthogonal, or “$Q$“, part of the QR decomposition. What remained–the reduced form of $\mathbf{X}$–is the upper triangular, or “$R$” part of the QR decomposition. At the same time, the rotation and rescalings (described as relabelings of the coordinates in the other post) constitute the $\mathbf{D}\cdot \mathbf{V}^\prime$ part of the SVD, $\mathbf{X} = \mathbf{U\, D\, V^\prime}$. The rows of $\mathbf{U}$, incidentally, form the point cloud displayed in the last figure of that post.

Finally, the analysis presented here generalizes in obvious ways to the cases $p\ne 2$: that is, when there are just one or more than two principal components.