# Why can R2R^2 be negative in linear regression — interview question [duplicate]

I was asked an $$R2R^2$$ question during an interview, and I felt like I was right then, and still feel like I’m right now. Essentially the interviewer asked me if it is possible for $$R2R^2$$ to be negative for linear regression.

I said that if you’re using OLS, then it is not possible because the formal definition of $$R2R^2$$ is

$$R2=1−SSresSStot R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

where $$SStot=∑ni(yi−ˉy)2SS_{tot} = \sum_i^n (y_i - \bar{y})^2$$ and $$SSres=∑ni(yi−^yi)2SS_{res} = \sum_i^n (y_i - \hat{y_i})^2$$.

In order for $$R2R^2$$ to be negative, the second term must be greater than 1. This would imply that $$SSres>SStotSS_{res} > SS_{tot}$$, which would imply that the predictive model fits worse than if you fit a straight line through the mean of the observed $$yy$$.

I told the interviewer that it is not possible for $$R2R^2$$ to be 1 because if the horizontal line is indeed the line of best fit, then OLS fill produce that line unless we’re dealing with an ill-conditioned or singular system.

He claimed that this isn’t correct and that $$R2R^2$$ can still be negative, and that I could “see it easily in the case where there is no intercept.” (note that all of the discussion so far was about the case WITH an intercept, which I confirmed at the beginning by asking if there are any constraints about the best line passing through the origin, which he stated “no”)

I can’t see this at all. I stood by my answer, and then mentioned that maybe if you used some other linear regression method, perhaps you can get a negative $$R2R^2$$.

Is there any way for $$R2R^2$$ to be negative using OLS with or without intercept? Edit: I do understand that you can get a negative $$R2R^2$$ in the case without an intercept.

The interviewer is right. Sorry.

set.seed(2020)
x <- seq(0, 1, 0.001)
err <- rnorm(length(x))
y <- 99 - 30*x + err
L <- lm(y~0+x) # "0" forces the intercept to be zero
plot(x, y, ylim=c(0, max(y)))
abline(a=0, b= summary(L)\$coef, col='red')
abline(h=mean(y), col='black')
SSRes <- sum(resid(L)^2)
SSTot <- sum((y - mean(y))^2)
R2 <- 1 - SSRes/SSTot
R2


I get $$R2=−31.22529R^2 = -31.22529$$. This makes sense when you look at the plot the code produces. The red line is the regression line. The black line is the “naive” line where you always guess the mean of $$yy$$, regardless of the $$xx$$.

The $$R2<0R^2<0$$ makes sense when you consider what $$R2R^2$$ does. $$R2R^2$$ measures how much better the regression model is at guessing the conditional mean than always guessing the pooled mean. Looking at the graph you're better off guessing the mean of the pooled values of $$yy$$ than you are using the regression line.

EDIT

There is an argument to be made that the "SSTot" to which you should compare an intercept-free model is just the sum of squares of $$yy$$ (so $$∑(yi−0)2\sum (y_i-0)^2$$), not $$∑(yi−ˉy)2\sum (y_i - \bar{y})^2$$. However, $$R2ish=1−∑(yi−ˆyi)2∑y2iR^2_{ish} = 1- \frac{\sum(y_i - \hat{y}_i)^2}{\sum y_i^2}$$ is quite different from the usual $$R2R^2$$ and (I think) loses the usual connection to amount of variance explained. If this $$R2ishR^2_{ish}$$ is used, however, when the intercept is excluded, $$R2ish≥0R^2_{ish} \ge 0$$.