Why does a cross-validation procedure overcome the problem of overfitting a model?
I can’t think of a sufficiently clear explanation just at the moment, so I’ll leave that to someone else; however cross-validation does not completely overcome the over-fitting problem in model selection, it just reduces it. The cross-validation error does not have a negligible variance, especially if the size of the dataset is small; in other words you get a slightly different value depending on the particular sample of data you use. This means that if you have many degrees of freedom in model selection (e.g. lots of features from which to select a small subset, many hyper-parameters to tune, many models from which to choose) you can over-fit the cross-validation criterion as the model is tuned in ways that exploit this random variation rather than in ways that really do improve performance, and you can end up with a model that performs poorly. For a discussion of this, see Cawley and Talbot “On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation”, JMLR, vol. 11, pp. 2079−2107, 2010
Sadly cross-validation is most likely to let you down when you have a small dataset, which is exactly when you need cross-validation the most. Note that k-fold cross-validation is generally more reliable than leave-one-out cross-validation as it has a lower variance, but may be more expensive to compute for some models (which is why LOOCV is sometimes used for model selection, even though it has a high variance).