What are the theoretical guarantees of bagging

I’ve (approximately) heard that:

bagging is a technique to reduce the variance of an predictor/estimator/learning algorithm.

However, I have never seen a formal mathematical proof of this statement. Does anyone know why this is mathematically true? It just seems to be such a widely accepted/known fact, that I’d expect a direct reference to this. I’d be surprised if there is non. Also, does anyone know what effect this has on the bias?

Are there any other theoretical guarantees of approaches bagging that anyone knows and thinks is important and wants to share it?


The main use-case for bagging is reducing variance of low-biased models by bunching them together. This was studied empirically in the landmark paper An Empirical Comparison of Voting Classification
Algorithms: Bagging, Boosting, and Variants
” by Bauer and Kohavi
. It usually works as advertised.

However, contrary to popular belief, bagging is not guaranteed to reduce the variance. A more recent and (in my opinion) better explanation is that bagging reduces the influence of leverage points. Leverage points are those that disproportionately affect the resulting model, such as outliers in least-squares regression. It is rare but possible for leverage points to positively influence resulting models, in which case bagging reduces performance. Have a look at Bagging equalizes influence” by Grandvalet.

So, to finally answer your question: the effect of bagging largely depends on leverage points. Few theoretical guarantees exist, except that bagging linearly increases computation time in terms of bag size! That said, it is still a widely used and very powerful technique. When learning with label noise, for instance, bagging can produce more robust classifiers.

Rao and Tibshirani have given a Bayesian interpretation in The out-of-bootstrap method for model averaging and selection:

In this sense, the bootstrap distribution represents an (approximate) nonparametric, non-informative posterior distribution for our parameter. But
this bootstrap distribution is obtained painlessly- without having to formally
specify a prior and without having to sample from the posterior distribution.
Hence we might think of the bootstrap distribution as a poor man’s” Bayes

Source : Link , Question Author : Charlie Parker , Answer Author : Marc Claesen

Leave a Comment