A common way to reduce overfitting in a machine learning algorithm is to use a regularization term that penalizes large weights (L2) or non-sparse weights (L1) etc. How can such regularization reduce overfitting, especially in a classification algorithm? Can one show this mathematically?
This is related to the Bias-Variance tradeoff. The expected error can be decomposed as
where the bias is the systematic deviation of our estimator, f, from the true value, i.e. E[f∗−f], where f∗ is the true estimator, and the variance is essentially how sensitive our estimator is to deviations in the training set. The σ2 term is the residual noise term; this term is irreducible and can not be made to impact less with mathematics (if your samples are noise, there might be something you can do in regards to collecting the data, though).
Overfitting happens when your model is too complicated to generalise for new data. When your model fits your data perfectly, it is unlikely to fit new data well.
Underfit happens when your model is not complicated enough. This introduces a bias in the model, such that there is systematic deviation from the true underlying estimator.
Regularization attemts to reduce the variance of the estimator by simplifying it, something that will increase the bias, in such a way that the expected error decreases. Often this is done in cases when the problem is ill-posed, e.g. when the number of parameters is greater than the number of samples.
Whether or not you obtain a successful reduction in expected variance depends on your estimator and the regularisation used. For instance, for multiple linear regression and ℓ2 regularisation, it is possible to prove that there is a solution that reduces the expected error by properly selecting the regularisation parameter.