# Why isn’t the holdout method (splitting data into training and testing) used in classical statistics?

In my classroom exposure to data mining, the holdout method was introduced as a way of assessing model performance. However, when I took my first class on linear models, this was not introduced as a means of model validation or assessment. My online research also doesn’t show any sort of intersection. Why is the holdout method not used in classical statistics?

A more productive question might be “why was it not used in the classical statistics I learned?”

Depending on the level(s) at which it was taught, the course content (and time available) that choice may be due to a combination of various factors. Often important topics are left aside because other material must be taught for one reason or another, with the hope that they might be covered in later subjects.

In some senses at least, the notion has long been used by a variety of people. It was more common in some areas than others. Many uses of statistics don’t have prediction or model selection as a major component (or in some cases, even at all), and in that case, the use of holdout samples may be less critical than when prediction is the main point. Arguably, it ought to have gained more widespread use at an earlier stage in some relevant applications than it did, but that’s not the same thing as being unknown.

If you look at areas that focus on prediction, the notion of model assessment by predicting data you didn’t use to estimate your model was certainly around (though not universal). I was certainly doing it with time series modelling I was doing in the 1980s, for example, where out-of-sample predictive performance of the most recent data was particularly important. It certainly wasn’t a novel idea then, there was plenty of examples of that sort of notion around at the time.

The notion of leaving out at least some data was used in regression (deleted residuals, PRESS, the jacknife, and so on), and in outlier analysis, for example.

Some of these ideas data back a good deal earlier still. Stone (1974)[1] refers to papers on cross-validation (with the word in the title) from the 1950s and 60s. Perhaps even closer to your intent, he mentions Simon (1971)’s use of the terms “construction sample” and “validation sample” — but also points out that “Larson (1931) employed random division of the sample in an educational multiple-regression study”.

Topics like cross validation, and the use of statistics based on prediction and so on, were becoming substantially more frequent in the statistics literature in the 70s and through the 80s, for example, but many of the basic ideas been around for quite some time even then.

[1]: Stone, M., (1974)
“Cross-Validatory Choice and Assessment of Statistical Predictions,”
Journal of the Royal Statistical Society. Series B (Methodological), Vol. 36, No. 2., pp. 111-147