Are PCA components (in principal component analysis) statistically independent if our data is multivariate normally distributed? If so, how can this be demonstrated/proven?

I ask because I saw this post, where the top answer states:

PCA does not make an explicit Gaussianity assumption. It finds the eigenvectors that maximize the variance explained in the data. The orthogonality of the principal components means that it finds the most uncorrelated components to explain as much variation in the data as possible. For multivariate gaussian distributions, zero correlation between components implies independence which is not true for most distributions.

The answer is stated without a proof, and seems to imply that PCA produces independent components if the data is multivariate normal.

Specifically, say our data are samples from:

$$\mathbf{x} \sim \mathcal N(\mathbf{\mu}, \mathbf{\Sigma})$$

we put $n$ samples of $\mathbf{x}$ into rows of our matrix of samples $\mathbf{X}$, so $\mathbf{X}$ is $n \times m$. Computing the SVD of $\mathbf{X}$ (after centering) yields

$$\mathbf{X} = \mathbf{USV}^{T}$$

Can we say that the columns of $\mathbf{U}$ are statistically independent, also then the rows of $\mathbf{V}^T$? Is this true in general, just for $\mathbf{x} \sim \mathcal N(\mathbf{\mu}, \mathbf{\Sigma})$, or not true at all?

**Answer**

**I will start with an intuitive demonstration.**

I generated $n=100$ observations (a) from a strongly non-Gaussian 2D distribution, and (b) from a 2D Gaussian distribution. In both cases I centered the data and performed the singular value decomposition $\mathbf X=\mathbf{USV}^\top$. Then for each case I made a scatter plot of the first two columns of $\mathbf U$, one against another. Note that it is usually columns of $\mathbf{US}$ that are called “principal components” (PCs); columns of $\mathbf U$ are PCs scaled to have unit norm; still, in this answer I am focusing on columns of $\mathbf U$. Here are the scatter-plots:

I think that statements such as “PCA components are uncorrelated” or “PCA components are dependent/independent” are usually made about one specific sample matrix $\mathbf X$ and refer to the correlations/dependencies *across rows* (see e.g. @ttnphns’s answer here). PCA yields a transformed data matrix $\mathbf U$, where rows are observations and columns are PC variables. I.e. we can see $\mathbf U$ as *a sample*, and ask what is the sample correlation between PC variables. This sample correlation matrix is of course given by $\mathbf U^\top \mathbf U=\mathbf I$, meaning that the sample correlations between PC variables are zero. This is what people mean when they say that “PCA diagonalizes the covariance matrix”, etc.

**Conclusion 1: in PCA coordinates, any data have zero correlation.**

This is true for the both scatterplots above. However, it is immediately obvious that the two PC variables $x$ and $y$ on the left (non-Gaussian) scatterplot are not independent; even though they have zero correlation, they are strongly dependent and in fact related by a $y\approx a(x-b)^2$. And indeed, it is well-known that uncorrelated does not mean independent.

On the contrary, the two PC variables $x$ and $y$ on the right (Gaussian) scatterplot seem to be “pretty much independent”. Computing mutual information between them (which is a measure of statistical dependence: independent variables have zero mutual information) by any standard algorithm will yield a value very close to zero. It will not be exactly zero, because it is never exactly zero for any finite sample size (unless fine-tuned); moreover, there are various methods to compute mutual information of two samples, giving slightly different answers. But we can expect that any method will yield an estimate of mutual information that is very close to zero.

**Conclusion 2: in PCA coordinates, Gaussian data are “pretty much independent”, meaning that standard estimates of dependency will be around zero.**

The question, however, is more tricky, as shown by the long chain of comments. Indeed, @whuber rightly points out that PCA variables $x$ and $y$ (columns of $\mathbf U$) *must* be statistically dependent: the columns have to be of unit length and have to be orthogonal, and this introduces a dependency. E.g. if some value in the first column is equal to $1$, then the corresponding value in the second column must be $0$.

This is true, but is only practically relevant for very small $n$, such as e.g. $n=3$ (with $n=2$ after centering there is only one PC). For any reasonable sample size, such as $n=100$ shown on my figure above, the effect of the dependency will be negligible; columns of $\mathbf U$ are (scaled) projections of Gaussian data, so they are also Gaussian, which makes it practically impossible for one value to be close to $1$ (this would require all other $n-1$ elements to be close to $0$, which is hardly a Gaussian distribution).

**Conclusion 3: strictly speaking, for any finite $n$, Gaussian data in PCA coordinates are dependent; however, this dependency is practically irrelevant for any $n\gg 1$.**

We can make this precise by considering what happens in the limit of $n \to \infty$. In the limit of infinite sample size, the sample covariance matrix is equal to the population covariance matrix $\mathbf \Sigma$. So if the data vector $X$ is sampled from $\vec X \sim \mathcal N(0,\boldsymbol \Sigma)$, then the PC variables are $\vec Y = \Lambda^{-1/2}V^\top \vec X/(n-1)$ (where $\Lambda$ and $V$ are eigenvalues and eigenvectors of $\boldsymbol \Sigma$) and $\vec Y \sim \mathcal N(0, \mathbf I/(n-1))$. I.e. PC variables come from a multivariate Gaussian with diagonal covariance. But any multivariate Gaussian with diagonal covariance matrix decomposes into a product of univariate Gaussians, and this is the definition of statistical independence:

\begin{align}

\mathcal N(\mathbf 0,\mathrm{diag}(\sigma^2_i)) &= \frac{1}{(2\pi)^{k/2} \det(\mathrm{diag}(\sigma^2_i))^{1/2}} \exp\left[-\mathbf x^\top \mathrm{diag}(\sigma^2_i) \mathbf x/2\right]\\&=\frac{1}{(2\pi)^{k/2} (\prod_{i=1}^k \sigma_i^2)^{1/2}} \exp\left[-\sum_{i=1}^k \sigma^2_i x_i^2/2\right]

\\&=\prod\frac{1}{(2\pi)^{1/2}\sigma_i} \exp\left[-\sigma_i^2 x^2_i/2\right]

\\&= \prod \mathcal N(0,\sigma^2_i).

\end{align}

**Conclusion 4: asymptotically ($n \to \infty$) PC variables of Gaussian data are statistically independent as random variables, and sample mutual information will give the population value zero.**

I should note that it is possible to understand this question differently (see comments by @whuber): to consider the whole matrix $\mathbf U$ a random variable (obtained from the random matrix $\mathbf X$ via a specific operation) and ask if any two specific elements $U_{ij}$ and $U_{kl}$ from two different columns are statistically independent across different draws of $\mathbf X$. We explored this question in this later thread.

**Here are all four interim conclusions from above:**

- In PCA coordinates, any data have zero correlation.
- In PCA coordinates, Gaussian data are “pretty much independent”, meaning that standard estimates of dependency will be around zero.
- Strictly speaking, for any finite $n$, Gaussian data in PCA coordinates are dependent; however, this dependency is practically irrelevant for any $n\gg 1$.
- Asymptotically ($n \to \infty$) PC variables of Gaussian data are statistically independent as random variables, and sample mutual information will give the population value zero.

**Attribution***Source : Link , Question Author : bill_e , Answer Author : Community*