I have been studying statistics from many books for the last 3 years, and thanks to this site I learned a lot. Nevertheless one fundamental question still remains unanswered for me. It may have a very simple or a very difficult answer, but I know for sure it requires some deep understanding of statistics.
When fitting a model to data, be it a frequentist or a Bayesian approach, we propose a model, which may consist of a functional form for likelihood, a prior, or a kernel (non-parametric), etc. The issue is any model fits a sample with some level of goodness. One can always find a better or worse model compared to what’s currently at hand. At some point we stop and start drawing conclusions, generalize to population parameters, report confidence intervals, calculate risk, etc. Hence, whatever conclusion we draw is always conditional on the model we decided to settle with. Even if we are using tools to estimate the expected KL distance such as AIC, MDL, etc., it doesn’t say anything about where we stand on an absolute basis, but just improves our estimation on a relative basis. It seems there is no objectivity as the model error is completely ignored.
Now suppose that we would like to define a step by step procedure to apply to any data set when building models. What should we specify as a stopping rule? Can we at least bound the model error which will give us an objective stopping point (this is different than stopping training using a validation sample, since it also gives a stopping point within the evaluated model class rather than w.r.t. the true DGP)?
Unfortunately, this question does not have a good answer. You can choose the best model based on the fact that it minimizes absolute error, squared error, maximizes likelihood, using some criteria that penalizes likelihood (e.g. AIC, BIC) to mention just a few most common choices. The problem is that neither of those criteria will let you choose the objectively best model, but rather the best from which you compared. Another problem is that while optimizing you can always end up in some local maximum/minimum. Yet another problem is that your choice of criteria for model selection is subjective. In many cases you consciously, or semi-consciously, make a decision on what you are interested in and choose the criteria based on this. For example, using BIC rather than AIC leads to more parsimonious models, with less parameters. Usually, for modeling you are interested in more parsimonious models that lead to some general conclusions about the universe, while for predicting it doesn’t have to be so and sometimes more complicated model can have better predictive power (but does not have to and often it does not). In yet other cases, sometimes more complicated models are preferred for practical reasons, for example while estimating Bayesian model with MCMC, model with hierarchical hyperpriors can behave better in simulation than the simpler one. On the other hand, generally we are afraid of overfitting and the simpler model has the lower risk of overfitting, so it is a safer choice. Nice example for this is a automatic stepwise model selection that is generally not recommended because it easily leads to overfitted and biased estimates. There is also a philosophical argument, Occam’s razor, that the simplest model is the preferred one. Notice also, that we are discussing here comparing different models, while in real life situations it also can be so that using different statistical tools can lead to different results – so there is an additional layer of choosing the method!
All this leads to sad, but entertaining, fact that we can never be sure. We start with uncertainty, use methods to deal with it and we end up with uncertanity. This may be paradoxical, but recall that we use statistics because we believe that world is uncertain and probabilistic (otherwise we would choose a career of prophets), so how could we possibly end up with different conclusions? There is no objective stopping rule, there are multiple possible models, all of them are wrong (sorry for the cliché!) because they try to simplify the complicated (constantly changing and probabilistic) reality. We find some of them more useful than others for our purposes and sometimes we do find different models useful for different purposes. You can go to the very bottom to notice that in many cases we make models of unknown $\theta$’s, that in most cases can never be known, or even do not exist (does a population has any $\mu$ for age?). Most models do not even try to describe the reality but rather provide abstractions and generalizations, so they cannot be “right”, or “correct”.
You can go even deeper and find out that there is no such a thing as “probability” in the reality – it is just some approximation of uncertainty around us and there are also alternative ways of approximating it like e.g. fuzzy logic (see Kosko, 1993 for discussion). Even the very basic tools and theorems that our methods are grounded on are approximations and are not the only ones that are possible. We simply cannot be certain in such a setup.
The stopping rule that you are looking for is always problem-specific and subjective, i.e. based on so called professional judgment. By the way, there are lots of research examples that have shown that professionals are often not better and sometimes even worse in their judgment than laypeople (e.g. revived in papers and books by Daniel Kahneman), while being more prone to overconfidence (this is actually an argument on why we should not try to be “sure” about our models).
Kosko, B. (1993). Fuzzy thinking: the new science of fuzzy logic. New York: Hyperion.