I am trying to understand some descriptions of PCA (the first two are from Wikipedia), emphasis added:
Principal components are guaranteed to be independent only if the data set is jointly normally distributed.
Is the independence of principal components very important? How can I understand this description?
PCA is sensitive to the relative scaling of the original variables.
What does ‘scaling’ mean there? Normalization of different dimensions?
The transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance under the constraint that it be orthogonal to the preceding components.
Can you explain this constraint?
Q1. Principal components are mutually orthogonal (uncorrelated) variables. Orthogonality and statistical independence are not synonyms. There is nothing special about principal components; the same is true of any variables in multivariate data analysis. If the data are multivariate normal (which is not the same as to state that each of the variables is univariately normal) and the variables are uncorrelated, then yes, they are independent. Whether independence of principal components matters or not – depends on how you are going to use them. Quite often, their orthogonality will suffice.
Q2. Yes, scaling means shrinking or stretching variance of individual variables. The variables are the dimensions of the space the data lie in. PCA results – the components – are sensitive to the shape of the data cloud, the shape of that “ellipsoid”. If you only center the variables, leave the variances as they are, this is often called “PCA based on covariances”. If you also standardize the variables to variances = 1, this is often called “PCA based on correlations”, and it can be very different from the former (see a thread). Also, relatively seldom people do PCA on non-centered data: raw data or just scaled to unit magnitude; results of such PCA are further different from where you center the data (see a picture).
Q3. The “constraint” is how PCA works (see a huge thread). Imagine your data is 3-dimensional cloud (3 variables, n points); the origin is set at the centroid (the mean) of it. PCA draws component1 as such an axis through the origin, the sum of the squared projections (coordinates) on which is maximized; that is, the variance along component1 is maximized. After component1 is defined, it can be removed as a dimension, which means that the data points are projected onto the plane orthogonal to that component. You are left with a 2-dimensional cloud. Then again, you apply the above procedure of finding the axis of maximal variance – now in this remnant, 2D cloud. And that will be component2. You remove the drawn component2 from the plane by projecting data points onto the line orthogonal to it. That line, representing the remnant 1D cloud, is defined as the last component, component 3. You can see that on each of these 3 “steps”, the analysis a) found the dimension of the greatest variance in the current p-dimensional space, b) reduced the data to the dimensions without that dimension, that is, to the p−1-dimensional space orthogonal to the mentioned dimension. That is how it turns out that each principal component is a “maximal variance” and all the components are mutually orthogonal (see also).
[P.S. Please note that “orthogonal” means two things: (1) variable axes as physically perpendicular axes; (2) variables as uncorrelated by their data. With PCA and some other multivariate methods, these two things are the same thing. But with some other analyses (e.g. Discriminant analysis), uncorrelated extracted latent variables does not automatically mean that their axes are perpendicular in the original space.]