Why can R2R^2 be negative in linear regression — interview question [duplicate]

I was asked an R2 question during an interview, and I felt like I was right then, and still feel like I’m right now. Essentially the interviewer asked me if it is possible for R2 to be negative for linear regression.

I said that if you’re using OLS, then it is not possible because the formal definition of R2 is

R2=1SSresSStot

where SStot=ni(yiˉy)2 and SSres=ni(yi^yi)2.

In order for R2 to be negative, the second term must be greater than 1. This would imply that SSres>SStot, which would imply that the predictive model fits worse than if you fit a straight line through the mean of the observed y.

I told the interviewer that it is not possible for R2 to be 1 because if the horizontal line is indeed the line of best fit, then OLS fill produce that line unless we’re dealing with an ill-conditioned or singular system.

He claimed that this isn’t correct and that R2 can still be negative, and that I could “see it easily in the case where there is no intercept.” (note that all of the discussion so far was about the case WITH an intercept, which I confirmed at the beginning by asking if there are any constraints about the best line passing through the origin, which he stated “no”)

I can’t see this at all. I stood by my answer, and then mentioned that maybe if you used some other linear regression method, perhaps you can get a negative R2.

Is there any way for R2 to be negative using OLS with or without intercept? Edit: I do understand that you can get a negative R2 in the case without an intercept.

Answer

The interviewer is right. Sorry.

set.seed(2020)
x <- seq(0, 1, 0.001)
err <- rnorm(length(x))
y <- 99 - 30*x + err
L <- lm(y~0+x) # "0" forces the intercept to be zero
plot(x, y, ylim=c(0, max(y)))
abline(a=0, b= summary(L)$coef[1], col='red')
abline(h=mean(y), col='black')
SSRes <- sum(resid(L)^2)
SSTot <- sum((y - mean(y))^2)
R2 <- 1 - SSRes/SSTot
R2 

I get R2=31.22529. This makes sense when you look at the plot the code produces.

enter image description here

The red line is the regression line. The black line is the “naive” line where you always guess the mean of y, regardless of the x.

The R2<0 makes sense when you consider what R2 does. R2 measures how much better the regression model is at guessing the conditional mean than always guessing the pooled mean. Looking at the graph you're better off guessing the mean of the pooled values of y than you are using the regression line.

EDIT

There is an argument to be made that the "SSTot" to which you should compare an intercept-free model is just the sum of squares of y (so (yi0)2), not (yiˉy)2. However, R2ish=1(yiˆyi)2y2i is quite different from the usual R2 and (I think) loses the usual connection to amount of variance explained. If this R2ish is used, however, when the intercept is excluded, R2ish0.

Attribution
Source : Link , Question Author : 24n8 , Answer Author : Dave

Leave a Comment