# Why do we use PCA to speed up learning algorithms when we could just reduce the number of features?

In a machine learning course, I learned that one common use of PCA (Principal Component Analysis) is to speed up other machine learning algorithms. For example, imagine you are training a logistic regression model. If you have a training set $(x^{(i)},y^{(i)})$ for i from 1 to n and it turns out the dimension of your vector x is very large (let’s say a dimensions), you can use PCA to get a smaller dimension (let’s say k dimensions) feature vector z. Then you can train your logistic regression model on the training set $(z^{(i)},y^{(i)})$ for i from 1 to n. Training this model will be faster because your feature vector has less dimensions.

However, I don’t understand why you can’t just reduce the dimension of your feature vector to k dimensions by just choosing k of your features at random and eliminating the rest.

The z vectors are linear combinations of your a feature vectors. Since the z vectors are confined to a k-dimensional surface, you can write the a-k eliminated feature values as a linear function of the k remaining feature values, and thus all the z’s can be formed by linear combinations of your k features. So shouldn’t a model trained on an training set with eliminated features have the same power as a model trained on a training set whose dimension was reduced by PCA? Does it just depend on the type of model and whether it relies on some sort of linear combination?

Let’s say you initially have $p$ features but this is too many so you want to actually fit your model on $d < p$ features. You could choose $d$ of your features and drop the rest. If $X$ is our feature matrix, this corresponds to using $XD$ where $D \in \{0,1\}^{p \times d}$ picks out exactly the columns of $X$ that we want to include. But this ignores all information in the other columns, so why not consider a more general dimension reduction $XV$ where $V \in \mathbb R^{p \times d}$? This is exactly what PCA does: we find the matrix $V$ such that $XV$ contains as much of the information in $X$ as possible. Not all linear combinations are created equally. Unless our $X$ matrix is so low rank that a random set of $d$ columns can (with high probability) span the column space of all $p$ columns we will certainly not be able to do just as well as with all $p$ features. Some information will be lost, and so it behooves us to lose as little information as possible. With PCA, the "information" that we're trying to avoid losing is the variation in the data.
As for why we restrict ourselves to linear transformations of the predictors, the whole point in this use-case is computation time. If we could do fancy non-linear dimension reduction on $X$ we could probably just fit the model on all of $X$ too. So PCA sits perfectly at the intersection of fast-to-compute and effective.