Geometric understanding of PCA in the subject (dual) space

I am trying to get an intuitive understanding of how principal component analysis (PCA) works in subject (dual) space.

Consider 2D dataset with two variables, x1 and x2, and n data points (data matrix X is n×2 and is assumed to be centered). The usual presentation of PCA is that we consider n points in R2, write down the 2×2 covariance matrix, and find its eigenvectors & eigenvalues; first PC corresponds to the direction of maximum variance, etc. Here is an example with covariance matrix C=(4222). Red lines show eigenvectors scaled by the square roots of the respective eigenvalues.

PCA in sample space

Now consider what happens in the subject space (I learned this term from @ttnphns), also known as dual space (the term used in machine learning). This is a n-dimensional space where the samples of our two variables (two columns of X) form two vectors x1 and x2. The squared length of each variable vector is equal to its variance, the cosine of the angle between the two vectors is equal to the correlation between them. This representation, by the way, is very standard in treatments of multiple regression. In my example, the subject space looks like that (I only show the 2D plane spanned by the two variable vectors):

PCA in subject space 1

Principal components, being linear combinations of the two variables, will form two vectors p1 and p2 in the same plane. My question is: what is the geometrical understanding/intuition of how to form principal component variable vectors using the original variable vectors on such a plot? Given x1 and x2, what geometric procedure would yield p1?


Below is my current partial understanding of it.

First of all, I can compute principal components/axes through the standard method and plot them on the same figure:

PCA in subject space 2

Moreover, we can note that the p1 is chosen such that the sum of squared distances between xi (blue vectors) and their projections on p1 is minimal; those distances are reconstruction errors and they are shown with black dashed lines. Equivalently, p1 maximizes the sum of the squared lengths of both projections. This fully specifies p1 and of course is completely analogous to the similar description in the primary space (see the animation in my answer to Making sense of principal component analysis, eigenvectors & eigenvalues). See also the first part of @ttnphns’es answer here.

However, this is not geometric enough! It doesn’t tell me how to find such p1 and does not specify its length.

My guess is that x1, x2, p1, and p2 all lie on one ellipse centered at 0 with p1 and p2 being its main axes. Here is how it looks like in my example:

enter image description here

Q1: How to prove that? Direct algebraic demonstration seems to be very tedious; how to see that this must be the case?

But there are many different ellipses centered at 0 and passing through x1 and x2:

enter image description here

Q2: What specifies the “correct” ellipse? My first guess was that it’s the ellipse with the longest possible main axis; but it seems to be wrong (there are ellipses with main axis of any length).

If there are answers to Q1 and Q2, then I would also like to know if they generalize to the case of more than two variables.

Answer

All the summaries of X displayed in the question depend only on its second moments; or, equivalently, on the matrix XX. Because we are thinking of X as a point cloud–each point is a row of X–we may ask what simple operations on these points preserve the properties of XX.

One is to left-multiply X by an n×n matrix U, which would produce another n×2 matrix UX. For this to work, it is essential that

XX=(UX)UX=X(UU)X.

Equality is guaranteed when UU is the n×n identity matrix: that is, when U is orthogonal.

It is well known (and easy to demonstrate) that orthogonal matrices are products of Euclidean reflections and rotations (they form a reflection group in Rn). By choosing rotations wisely, we can dramatically simplify X. One idea is to focus on rotations that affect only two points in the cloud at a time. These are particularly simple, because we can visualize them.

Specifically, let (xi,yi) and (xj,yj) be two distinct nonzero points in the cloud, constituting rows i and j of X. A rotation of the column space Rn affecting only these two points converts them to

{(xi,yi)=(cos(θ)xi+sin(θ)xj,cos(θ)yi+sin(θ)yj)(xj,yj)=(sin(θ)xi+cos(θ)xj,sin(θ)yi+cos(θ)yj).

What this amounts to is drawing the vectors (xi,xj) and (yi,yj) in the plane and rotating them by the angle θ. (Notice how the coordinates get mixed up here! The x‘s go with each other and the y‘s go together. Thus, the effect of this rotation in Rn will not usually look like a rotation of the vectors (xi,yi) and (xj,yj) as drawn in R2.)

By choosing the angle just right, we can zero out any one of these new components. To be concrete, let’s choose θ so that

{cos(θ)=±xix2i+x2jsin(θ)=±xjx2i+x2j.

This makes xj=0. Choose the sign to make yj0. Let’s call this operation, which changes points i and j in the cloud represented by X, γ(i,j).

Recursively applying γ(1,2),γ(1,3),,γ(1,n) to X will cause the first column of X to be nonzero only in the first row. Geometrically, we will have moved all but one point in the cloud onto the y axis. Now we may apply a single rotation, potentially involving coordinates 2,3,,n in Rn, to squeeze those n1 points down to a single point. Equivalently, X has been reduced to a block form

X=(x1y10z),

with 0 and z both column vectors with n1 coordinates, in such a way that

XX=((x1)2x1y1x1y1(y1)2+||z||2).

This final rotation further reduces X to its upper triangular form

X=(x1y10||z||0000).

In effect, we can now understand X in terms of the much simpler 2×2 matrix (x1y10||z||) created by the last two nonzero points left standing.

To illustrate, I drew four iid points from a bivariate Normal distribution and rounded their values to

X=(0.090.120.310.630.740.231.80.39)

This initial point cloud is shown at the left of the next figure using solid black dots, with colored arrows pointing from the origin to each dot (to help us visualize them as vectors).

Figure

The sequence of operations effected on these points by γ(1,2),γ(1,3), and γ(1,4) results in the clouds shown in the middle. At the very right, the three points lying along the y axis have been coalesced into a single point, leaving a representation of the reduced form of X. The length of the vertical red vector is ||z||; the other (blue) vector is (x1,y1).

Notice the faint dotted shape drawn for reference in all five panels. It represents the last remaining flexibility in representing X: as we rotate the first two rows, the last two vectors trace out this ellipse. Thus, the first vector traces out the path

θ  (cos(θ)x1,cos(θ)y1+sin(θ)||z||)

while the second vector traces out the same path according to

θ  (sin(θ)x1,sin(θ)y1+cos(θ)||z||).

We may avoid tedious algebra by noting that because this curve is the image of the set of points {(cos(θ),sin(θ)):0θ<2π} under the linear transformation determined by

(1,0)  (x1,0);(0,1)  (y1,||z||),

it must be an ellipse. (Question 2 has now been fully answered.) Thus there will be four critical values of θ in the parameterization (1), of which two correspond to the ends of the major axis and two correspond to the ends of the minor axis; and it immediately follows that simultaneously (2) gives the ends of the minor axis and major axis, respectively. If we choose such a θ, the corresponding points in the point cloud will be located at the ends of the principal axes, like this:

Figure 2

Because these are orthogonal and are directed along the axes of the ellipse, they correctly depict the principal axes: the PCA solution. That answers Question 1.


The analysis given here complements that of my answer at Bottom to top explanation of the Mahalanobis distance. There, by examining rotations and rescalings in R2, I explained how any point cloud in p=2 dimensions geometrically determines a natural coordinate system for R2. Here, I have shown how it geometrically determines an ellipse which is the image of a circle under a linear transformation. This ellipse is, of course, an isocontour of constant Mahalanobis distance.

Another thing accomplished by this analysis is to display an intimate connection between QR decomposition (of a rectangular matrix) and the Singular Value Decomposition, or SVD. The γ(i,j) are known as Givens rotations. Their composition constitutes the orthogonal, or “Q“, part of the QR decomposition. What remained–the reduced form of X–is the upper triangular, or “R” part of the QR decomposition. At the same time, the rotation and rescalings (described as relabelings of the coordinates in the other post) constitute the DV part of the SVD, X=UDV. The rows of U, incidentally, form the point cloud displayed in the last figure of that post.

Finally, the analysis presented here generalizes in obvious ways to the cases p2: that is, when there are just one or more than two principal components.

Attribution
Source : Link , Question Author : amoeba , Answer Author : Community

Leave a Comment