I am interested in getting an unbiased estimate of R2 in a multiple linear regression.
On reflection, I can think of two different values that an unbiased estimate of R2 might be trying to match.
- Out of sample R2: the r-square that would be obtained if the regression equation obtained from the sample (i.e., ˆβ) were applied to an infinite amount of data external to the sample but from the same data generating process.
- Population R2: The r-square that would be obtained if an infinite sample were obtained and the model fitted to that infinite sample (i.e., β) or alternatively just the R-square implied by the known data generating process.
I understand that adjusted R2 is designed to compensate for the overfitting observed in sample R2. Nonetheless, it’s not clear whether adjusted R2 is actually an unbiased estimate of R2, and if it is an unbiased estimate, which of the above two definitions of R2 it is aiming to estimate.
Thus, my questions:
- What is an unbiased estimate of what I call above out of sample R2?
- What is an unbiased estimate of what I call above population R2?
- Are there any references that provide simulation or other proof of the unbiasedness?
Evaluation of analytic adjustments to R-square
@ttnphns referred me to the Yin and Fan (2001) article that compares different analytic methods of estimating R2. As per my question they discriminate between two types of estimators.
They use the following terminology:
- ρ2: Estimator of the squared population multiple correlation coefficient
- ρ2c: Estimator of the squared population cross-validity coefficient
Their results are summarised in the abstract:
The authors conducted a Monte Carlo experiment to investigate the
effectiveness of the analytical formulas for estimating R2 shrinkage,
with 4 fully crossed factors (squared population multiple correlation
coefficient, number of predictors, sample size, and degree of
multicollinearity) and 500 replications in each cell. The results
indicated that the most widely used Wherry formula (in both SAS and
SPSS) is probably not the most effective analytical formula for
estimating ρ2. Instead, the Pratt formula and the Browne formula
outperformed other analytical formulas in estimating ρ2 and ρ2c,
Thus, the article implies that the Pratt formula (p.209) is a good choice for estimating ρ2:
where N is the sample size, and p is the number of predictors.
Empirical estimates of adjustments to R-square
Kromrey and Hines (1995) review empirical estimates of R2 (e.g., cross-validation approaches). They show that such algorithms are inappropriate for estimating ρ2. This makes sense given that such algorithms seem to be designed to estimate ρ2c. However, after reading this, I still wasn’t sure whether some form of appropriately corrected empirical estimate might still perform better than analytic estimates in estimating ρ2.
- Kromrey, J. D., & Hines, C. V. (1995). Use of empirical estimates of shrinkage in multiple regression: a caution. Educational and Psychological Measurement, 55(6), 901-925.
- Yin, P., & Fan, X. (2001). Estimating R2 shrinkage in multiple regression: A comparison of different analytical methods. The Journal of Experimental Education, 69(2), 203-224. PDF