I am wondering, if there is a negative correlation between Mean Squared Error

MSE=1n∑(ˆYi−Yi)2

and the number of explanatory variables. Intuitively, I would guess that more independent variables should be able to explain more variation in my dependent variable. However, I was not able to find any literature about the topic.

Question 1: Is MSE decreasing with increasing number of explanatory variables? If yes, why?

Question 2: Is there any literature about the topic available?

**Answer**

I am assuming that you are talking about an ordinary least squares regression scenario and are referring to in-sample MSE, and that Y is an n-by-1 vector and X is an n-by-p matrix of orthogonal predictors (or variables, by your terminology). Remember that the columns of any matrix X can be orthogonalized; this will become important for making an intuitive leap later on. Let’s also assume that the columns of X have variance 1/n and are centered, such that their means are zero.

Granted the foregoing, the answer to (1) is *yes.* Here’s why.

MSE = (1/n)‖

=(1/n)\|Y-Xβ\|^2

=(1/n)\|Y-X(X^TX)^{-1}X^TY\|^2

Now, (X^TX)^{-1} is simply a p-by-p identity matrix (this follows from the orthogonality we imposed earlier). It then follows that (X^TX)^{-1}X^T=X^T, and we have

MSE = (1/n)\|Y-XX^TY\|^2

So, what can we say about XX^T? We know it is an n-by-n matrix, and in the special case of p=n, it is an n-by-n *identity* matrix. That is, for p=n,

MSE = (1/n)\|Y-Y\|^2 = 0

Which we know to be intuitively correct. Furthermore, we know that MSE is at its maximum when we lack any predictors and X is simply a column of ones; this is how we would fit an intercept-only model. In such a case, XX^T is an n-by-n matrix of ones. As p gets larger, the off-diagonal elements of XX^T shrink, eventually reaching zero when p=n.

This is not a rigorous proof and it does not, in fact, demonstrate that MSE is monotonically decreasing with p, but I think it provides a good intuitive foundation for understanding the behavior of least squares fitting.

Edit: If you want to extend this analysis to estimating MSE out of sample, then you would consider the following:

\hat{MSE}=\hat{bias}^2+\hat{var}

\hat{bias}^2 is monotonically decreasing with p, and \hat{var} is monotonically increasing with p. There are some relationships between p, n, and \hat{MSE}, for that I recommend Wessel van Wieringen’s lecture notes on ridge regression as well as Elements of Statistical Learning, as mentioned in another answer to your original question. Hopefully that answers (2).

Edit: I thought about this some more and is are two additional points I’d like to make. The first is the specific conditions under which an additional predictor will reduce in-sample MSE. Those conditions are:

1) The additional predictor does not lie entirely within the column space of X; that is, it cannot be obtained via any linear combination of the existing predictors, and

2) The component of the new predictor lying outside the column space of X is not orthogonal to Y.

The second point is that we can do a simple thought experiment showing that the addition of new predictors does, in general, tend to decrease in-sample MSE. Imagine we have solved our linear regression and obtained β, that is, a p-by-1 vector of model coefficients. Now imagine that we add an additional predictor. Unless BOTH of the two aforementioned conditions are satisfied, the (p+1) value of β will be *zero*, and the model is exactly the same as it was prior to the addition of the new predictor (same MSE). In general, though, both of those conditions *will* be satisfied, and therefore the (p+1) value of β will be something *other than* zero. Since both the zero-appended β and nonzero-appended β lie within the solution space of the least squares regression with p+1 predictors, we conclude that the p+1 model *must* have lower MSE than the p model if the new coefficient is anything other than zero.

**Attribution***Source : Link , Question Author : Joachim Schork , Answer Author : Josh*