# Is MSE decreasing with increasing number of explanatory variables?

I am wondering, if there is a negative correlation between Mean Squared Error

and the number of explanatory variables. Intuitively, I would guess that more independent variables should be able to explain more variation in my dependent variable. However, I was not able to find any literature about the topic.

Question 1: Is MSE decreasing with increasing number of explanatory variables? If yes, why?

Question 2: Is there any literature about the topic available?

I am assuming that you are talking about an ordinary least squares regression scenario and are referring to in-sample MSE, and that $Y$ is an n-by-1 vector and $X$ is an n-by-p matrix of orthogonal predictors (or variables, by your terminology). Remember that the columns of any matrix $X$ can be orthogonalized; this will become important for making an intuitive leap later on. Let’s also assume that the columns of $X$ have variance $1/n$ and are centered, such that their means are zero.

Granted the foregoing, the answer to (1) is yes. Here’s why.

MSE = $(1/n)\|Y-\hat{Y}\|^2$

$=(1/n)\|Y-Xβ\|^2$

$=(1/n)\|Y-X(X^TX)^{-1}X^TY\|^2$

Now, $(X^TX)^{-1}$ is simply a p-by-p identity matrix (this follows from the orthogonality we imposed earlier). It then follows that $(X^TX)^{-1}X^T=X^T$, and we have

MSE = $(1/n)\|Y-XX^TY\|^2$

So, what can we say about $XX^T$? We know it is an n-by-n matrix, and in the special case of p=n, it is an n-by-n identity matrix. That is, for p=n,

MSE = $(1/n)\|Y-Y\|^2 = 0$

Which we know to be intuitively correct. Furthermore, we know that MSE is at its maximum when we lack any predictors and $X$ is simply a column of ones; this is how we would fit an intercept-only model. In such a case, $XX^T$ is an n-by-n matrix of ones. As p gets larger, the off-diagonal elements of $XX^T$ shrink, eventually reaching zero when p=n.

This is not a rigorous proof and it does not, in fact, demonstrate that MSE is monotonically decreasing with p, but I think it provides a good intuitive foundation for understanding the behavior of least squares fitting.

Edit: If you want to extend this analysis to estimating MSE out of sample, then you would consider the following:

$\hat{MSE}=\hat{bias}^2+\hat{var}$

$\hat{bias}^2$ is monotonically decreasing with p, and $\hat{var}$ is monotonically increasing with p. There are some relationships between p, n, and $\hat{MSE}$, for that I recommend Wessel van Wieringen’s lecture notes on ridge regression as well as Elements of Statistical Learning, as mentioned in another answer to your original question. Hopefully that answers (2).

Edit: I thought about this some more and is are two additional points I’d like to make. The first is the specific conditions under which an additional predictor will reduce in-sample MSE. Those conditions are:

1) The additional predictor does not lie entirely within the column space of $X$; that is, it cannot be obtained via any linear combination of the existing predictors, and

2) The component of the new predictor lying outside the column space of $X$ is not orthogonal to $Y$.

The second point is that we can do a simple thought experiment showing that the addition of new predictors does, in general, tend to decrease in-sample MSE. Imagine we have solved our linear regression and obtained $β$, that is, a p-by-1 vector of model coefficients. Now imagine that we add an additional predictor. Unless BOTH of the two aforementioned conditions are satisfied, the (p+1) value of $β$ will be zero, and the model is exactly the same as it was prior to the addition of the new predictor (same MSE). In general, though, both of those conditions will be satisfied, and therefore the (p+1) value of $β$ will be something other than zero. Since both the zero-appended $β$ and nonzero-appended $β$ lie within the solution space of the least squares regression with p+1 predictors, we conclude that the p+1 model must have lower MSE than the p model if the new coefficient is anything other than zero.