In Bishop’s PRML book, he says that, overfitting is a problem with Maximum Likelihood Estimation (MLE), and Bayesian can avoid it.
But I think, overfitting is a problem more about model selection, not about the method used to do parameter estimation. That is, suppose I have a data set $D$, which is generated via $$f(x)=sin(x),\;x\in[0,1]$$, now I might choose different models $H_i$ to fit the data and find out which one is the best. And the models under consideration are polynomial ones with different orders, $H_1$ is order 1, $H_2$ is order 2, $H_3$ is order 9.
Now I try to fit the data $D$ with each of the 3 models, each model has its paramters, denoted as $w_i$ for $H_i$.
Using ML, I will have a point estimate of the model parameters $w$, and $H_1$ is too simple and will always underfit the data, whereas $H_3$ is too complex and will overfit the data, only $H_2$ will fit the data well.
My questions are,
1) Model $H_3$ will overfit the data, but I don’t think it’s the problem of ML, but the problem of the model per se. Because, using ML for $H_1,H_2$ doesn’t result into overfitting. Am I right?
2) Compared to Bayesian, ML does have some disadvantages, since it just gives the point estimate of the model parameters $w$, and it’s overconfident. Whereas Bayesian doesn’t rely on just the most probable value of the parameter, but all the possible values of the parameters given the observed data $D$, right?
3) Why can Bayesian avoid or decrease overfitting? As I understand it, we can use Bayesian for model comparison, that is, given data $D$, we could find out the marginal likelihood (or model evidence) for each model under consideration, and then pick the one with the highest marginal likelihood, right? If so, why is that?
Optimisation is the root of all evil in statistics. Any time you make choices about your model$^1$ by optimising some suitable criterion evaluated on a finite sample of data you run the risk of over-fitting the criterion, i.e. reducing the statistic beyond the point where improvements in generalisation performance are obtained and the reduction is instead gained by exploiting the peculiarities of the sample of data, e.g. noise). The reason the Bayesian method works better is that you don’t optimise anything, but instead marginalise (integrate) over all possible choices. The problem then lies in the choice of prior beliefs regarding the model, so one problem has gone away, but another one appears in its place.
$^1$ This includes maximising the evidence (marginal likelihood) in a Bayesian setting. For an example of this, see the results for Gaussian Process classifiers in my paper, where optimising the marginal likelihood makes the model worse if you have too many hyper-parameters (note selection according to marginal likelihood will tend to favour models with lots of hyper-parameters as a result of this form of over-fitting).
G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (pdf)