I have always been confused about the use of the term “population” in statistics. In my first statistics course I was taught that we need a sample, because surveying the whole population is too costly. So there is the whole population and there is small sample from it which we study.
The problem is that this intuition is just wrong outside of a few toy examples, when population is literally the whole population of the US (or world). Actually even in those few examples it is probably wrong, since world population is just one of hypothetical repeated random samples from DGP. So when in the following statistics courses we started estimating multivariate models, I was struggling to understand what the population is now and how it differs from the sample.
So I am really confused by the way statistics is taught. I feel like people use the term “population” partly because of historical reasons, partly because it makes it easier to explain the concept of the sample in Stat 101. The problem is that it teaches wrong intuition, which students have to unlearn later and creates a hole in understanding of the most fundamental statistical concepts. On the other hand, the concept of DGP is harder to introduce in elementary statistics course, but after students understand it, they will have solid conceptual foundation in statistics.
I have two questions:
I would guess that there is ongoing discussion among statisticians on this issue, so can anybody give me references on this?
And more importantly, do you know any examples of introductory-level statistics textbooks, which forego “population” and introduce statistics, based on concepts of DGP and sample? Ideally, such textbook will devote large space to explaining conceptual foundations of statistics and statistical inference.
There are certainly already many contexts where statisticians do refer to a process rather than a population when discussing statistical analysis (e.g., when discussing a time-series process, stochastic process, etc.). Formally, a stochastic process is a set of random variables with a common domain, indexed over some set of values. This includes time-series, sequences of random variables, etc. The concept is general enough to encompass most situations where we have a set of random variables that are of interest in a statistical problem, and so statistics already has a sufficiently well-developed language to refer to hypothesised stochastic “processes”, and also refer to actual “populations” of things.
Whilst statisticians do refer to and model “processes”, these are abstractions that are formed by considering infinite sequences (or continuums) of random variables, and so they involve hypothesising quantities that are not all observable. The term “data-generating process” is itself problematic (and not as helpful as the existing terminology of a “stochastic process”), and I see no reason that its wide deployment would add greater understanding to statistics. Specifically, by referring to the generation of “data” this terminology pre-empts the question of what quantities are actually observed or observable. (Imagine a situation in which you want to refer to a “DGP” but then stipulate that some aspect of that process is not directly observable. Is it still appropriate to call the values in that process “data” if they are not observable?) In any case, setting aside the terminology, I see deeper problems in your approach, which go back to base issues in philosophy and the formulation of research questions.
Existents vs processes in empirical research: I see a number of premises in your view that strike me as problematic, and appear to me to misunderstand the goal of most empirical research that uses statistics. When we undertake empirical research, we often want to know about relationships between things that exist in reality, not hypothesised “processes” that exist only in our models (i.e., as mathematical abstractions from reality). Indeed, in sampling problems it is usually the case that we merely wish to estimate some aspect of the distribution of some quantity pertaining to a finite population. In this context, when we refer to a “population” of interest, we are merely designating a set of things that are of interest to us in a particular research problem. Consequently, if we are presently interested in all the people currently living in the USA, we would call this group the “population” (or the “population of interest”). However, if we are interested only in the people currently living in Maine, then we would call this smaller group the “population”. In each case, it does not matter whether the population can be considered as only part of a larger group — if it is the group of interest in the present problem then we will designate it as the “population”.
(I note that statistical texts often engage in a slight equivocation between the population of objects of interest, and the measurements of interest pertaining to those objects. For example, an analysis on the height of people might at various times refer to the set of people as “the population” but then refer to the corresponding set of height measurements as “the population”. This is a shorthand that allows statisticians to get directly to describing a set of numbers of interest.)
Your philosophical approach here is at odds with this objective. You seem to be adopting a kind of Platonic view of the world, in which real-world entities are considered to be less real than some hypothesised “data-generating process” that (presumptively) generated the world. For example, in regard to the idea of referring to all the people on Earth as a “populuation”, you claim that “…it is probably wrong, since world population is just one of hypothetical repeated random samples from DGP”. This bears a substantial similarity to Plato’s theory of forms, where Plato regarded observation of the world to be a mere imperfect observation of eternal Forms. In my view, a much better approach is the Aristolelian view that the things in reality exist, and we abstract from them to form our concepts. (This is a simplification of Aristotle, but you get the basic idea.)†
If you would like to get into literature on this issue, I think you will find that it goes deeper into the territory of philosophy (specifically metaphysics and epistemology), rather than the field of statistics. Essentially, your views here are about the broader issue of whether the things existing in reality are the proper objects of relevance to human knowledge, or whether (contrarily) they are merely an epiphenomenon of some broader hypothesised “process” that is the proper object of human inference. This is a philosophical question that has been a major part of the history of Western philosophy going back to Plato and Aristotle, so there is an enormous literature that could potentially shed light on this.
I hope that this answer will set you off on the interesting journey into the field of epistemology. For present purposes, you might wish to take a practical view that also considers the objectives that researchers set for themselves in their research. Ask yourself: would researchers generally prefer to know about properties of the people living on Earth, or would they prefer to try to find out about your (hypothesised) “hypothetical repeated random samples” of people who might have lived on Earth instead of us?
† To avoid any possible confusion among those lacking historical knowledge, please note that these are not real quotes from Plato and Aristotle — I have merely taken poetic license to liken their philosophical positions to the present issue.