How to calculate out of sample R squared?

I know this probably has been discussed somewhere else, but I have not been able to find an explicit answer. I am trying to use the formula $R^2 = 1 – SSR/SST$ to calculate out-of-sample $R^2$ of a linear regression model, where $SSR$ is the sum of squared residuals and $SST$ is the total sum of squares. For the training set, it is clear that

$$ SST = \Sigma (y – \bar{y}_{train})^2 $$

What about the testing set? Should I keep using $\bar{y}_{train}$ for out of sample $y$, or use $\bar{y}_{test}$ instead?

I found that if I use $\bar{y}_{test}$, the resulting $R^2$ can be negative sometimes. This is consistent with the description of sklearn’s r2_score() function, where they used $\bar{y}_{test}$ (which is also used by their linear_model’s score() function for testing samples). They state that “a constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.”

However, in other places people have used $\bar{y}_{train}$ like here and here (the second answer by dmi3kno). So I was wondering which makes more sense? Any comment will be greatly appreciated!

Answer

You are correct.

The OSR$^2$ residuals are based on testing data, but the baseline should still be training data. With that said, your SST is $SST=Σ(y−\bar y_{train})^2$; notice that the is the same for $R^2$

Attribution
Source : Link , Question Author : crazydriver , Answer Author : Nick Cox

Leave a Comment