I want to build a regression model that is an average of multiple OLS models, each based on a subset of the full data. The idea behind this is based on this paper. I create k folds and build k OLS models, each on data without one of the folds. I then average the regression coefficients to get the final model.

This strikes me as similar to something like random forest regression, in which multiple regression trees are built and averaged. However, performance of the averaged OLS model seems worse than simply building one OLS model on the entire data. My question is: is there a theoretical reason why averaging multiple OLS models is wrong or undesirable? Can we expect averaging multiple OLS models to reduce overfitting? Below is an R example.

`#Load and prepare data library(MASS) data(Boston) trn <- Boston[1:400,] tst <- Boston[401:nrow(Boston),] #Create function to build k averaging OLS model lmave <- function(formula, data, k, ...){ lmall <- lm(formula, data, ...) folds <- cut(seq(1, nrow(data)), breaks=k, labels=FALSE) for(i in 1:k){ tstIdx <- which(folds==i, arr.ind = TRUE) tst <- data[tstIdx, ] trn <- data[-tstIdx, ] assign(paste0('lm', i), lm(formula, data = trn, ...)) } coefs <- data.frame(lm1=numeric(length(lm1$coefficients))) for(i in 1:k){ coefs[, paste0('lm', i)] <- get(paste0('lm', i))$coefficients } lmnames <- names(lmall$coefficients) lmall$coefficients <- rowMeans(coefs) names(lmall$coefficients) <- lmnames lmall$fitted.values <- predict(lmall, data) target <- trimws(gsub('~.*$', '', formula)) lmall$residuals <- data[, target] - lmall$fitted.values return(lmall) } #Build OLS model on all trn data olsfit <- lm(medv ~ ., data=trn) #Build model averaging five OLS olsavefit <- lmave('medv ~ .', data=trn, k=5) #Build random forest model library(randomForest) set.seed(10) rffit <- randomForest(medv ~ ., data=trn) #Get RMSE of predicted fits on tst library(Metrics) rmse(tst$medv, predict(olsfit, tst)) [1] 6.155792 rmse(tst$medv, predict(olsavefit, tst)) [1] 7.661 ##Performs worse than olsfit and rffit rmse(tst$medv, predict(rffit, tst)) [1] 4.259403`

**Answer**

Given that OLS minimizes the MSE of the residuals amongst all unbiased linear estimators (by the Gauss-Markov theorem) , and that a weighted average of unbiased linear estimators (e.g., the estimated linear functions from each of your $k$ folds) is itself an unbiased linear estimator, it must be that OLS applied to the entire data set will outperform the weighted average of the $k$ linear regressions unless, by chance, the two give identical results.

As to overfitting – linear models are not prone to overfitting in the same way that, for example, Gradient Boosting Machines are. The enforcement of linearity sees to that. If you have a very small number of outliers that pull your OLS regression line well away from where it should be, your approach may slightly – only slightly – ameliorate the damage, but there are far superior approaches to dealing with that problem in the context of a very small number of outliers, e.g., robust linear regression, or simply plotting the data, identifying, and then removing the outliers (assuming that they are indeed not representative of the data generating process whose parameters you are interested in estimating.)

**Attribution***Source : Link , Question Author : Gaurav Bansal , Answer Author : jbowman*