# Random sampling and independence in a real world problem

In the book “Introduction to Econometrics” by Stock and Watson It is used this example to illustrate the relation between random sampling and independence of the random variables : My problem is that in this case I don’t get why the random sampling should imply independence.

In a simpler experiment I could have an urn with $$N$$ balls inside so that each of these balls have a probability of $$\frac{1}{N}$$ to be drawn: if I draw a ball and then I insert the ball again inside the urn and repeat, the two draws are independent because the urn has still the same composition (or even if I don’t insert the ball I can say they are approximately independent if $$N$$ is big enough) and the two extracted balls together are a random sample.

In the example above, instead, there is something different (at least it seems so to me). It is true that every element has the same probability to be extracted at each draw, but I don’t see the link between this fact and the independence of the random variables. Why is this the case? In the simpler experiment I mentioned it was because I inserted the ball again in the urn and the composition of the urn was the same as before; but here, after I select at random the first day and I observe the commuting time I know something new because that day has a specific commuting time and not anymore just a cumulative distribution function which measures the probability of the commuting time so, when I insert that day again inside the “urn”, the commuting time of that specific day is known so it is not the same as before. Can someone clarify please? Is the difference not important? Why?

This extract from the text suffers from ambiguity and incorrectness.

Let’s deal with the latter first. Independence of two random variables $$X$$ and $$Y$$ is not about one variable “providing no information about the first” (a remarkably ambiguous phrase in its own right!). Independence is strictly about probabilities and it means nothing more nor less than the chance of any joint event (namely, that the value of $$X$$ lies in some set $$\mathcal A$$ and the value of $$Y$$ simultaneously lies in some other set $$\mathcal B$$) is determined from the separate chances alone (namely, by multiplying them).

In this context it is natural to set up an urn model to understand the sampling. An extreme instance of this situation occurs with a truncated school year (as many have recently experienced!) in which the student commutes on just two days. The urn would contain two slips of paper representing the two commutes. On each slip is written the time of that commute. A random sample of size one is obtained by withdrawing a single slip blindly. Let $$X$$ be the value on that slip: it is a random variable. Let $$Y$$ be the collection of values on all remaining slips in the urn (namely, the commuting day that was not selected). It is straightforward to show that the random variable $$(X,Y)$$ is not independent: indeed, the correlation between $$X$$ and $$Y$$ is $$-1$$ and any variables with nonzero correlation are not independent.

If you find samples of size $$1$$ conceptually objectionable, extend this example to a school year with three commuting days and consider a random sample (without replacement) of size $$2.$$ This sample consists of withdrawing two tickets — in order — without replacement. Let $$X_1$$ be the value written on the first ticket and $$X_2$$ the value on the second. The correlation of the random variable $$(X_1,X_2)$$ is $$-1/2,$$ again nonzero: these two commuting times are not independent. (Question on Covariance for sampling without replacement explains how to calculate this covariance.)

It is possible the authors had in mind a model in which the urn is filled with gazillions of tickets reflecting some distribution of “hypothetical” commuting times. If so, the sample values will behave practically as if they were independent. But what would be the conceptual basis for constructing such a model?

The authors might also have (implicitly) been appealing to the idea that when there is a “large” number of tickets in the urn and “relatively few” are withdrawn for the sample, the values on the sampled tickets are approximately independent. But that sounds just too qualitative and slippery to serve as a decent explanation for any audience.

The more we think about this situation, the more reality intrudes. For instance, even when a school year comprises a full 180 (or so) days, why should we suppose the commuting times sampled during winter months “provide no information” about other nearby commuting times? In regions with serious winter weather nobody would believe this. “I see it took you two hours to get to school yesterday. Must be a lot of snow out there. I bet your ride during the next week is going to be extra long.”

We have already glossed over several ambiguities concerning what is meant by “no information” and what model is in use. There are other ambiguities. For the purposes of evaluating independence of values in the sample, should we — or should we not — suppose we might inspect the full contents of the urn? If one commuting time “provides no information” about any other commuting time in the sample, then how much less information must it provide about commuting times that weren’t sampled! How, then, could it be possible to make any inferences at all about the year’s commuting times based on the sampled values?

Although it might seem painful or excessively technical to do so, the only way to demonstrate independence of random variables must appeal to its probabilistic definition. That requires clearly indicating a probability model and showing that the probabilities in that model obey the product law that is characteristic of independence. Anything else is just hand-waving and threatens to confuse the thoughtful student.