# What is the frequentist take on the voltmeter story?

What is the frequentist take on the voltmeter story and its variations? The idea behind it is that a statistical analysis that appeals to hypothetical events would have to be revised if it was later learned that those hypothetical events could not have taken place as assumed.

The version of the story on Wikipedia is provided below.

An engineer draws a random sample of electron tubes and measures their
voltage. The measurements range from 75 to 99 volts. A statistician
computes the sample mean and a confidence interval for the true mean.
Later the statistician discovers that the voltmeter reads only as far
as 100, so the population appears to be ‘censored’. This necessitates
a new analysis, if the statistician is orthodox. However, the engineer
says he has another meter reading to 1000 volts, which he would have
used if any voltage had been over 100. This is a relief to the
statistician, because it means the population was effectively
uncensored after all. But, the next day the engineer informs the
statistician that this second meter was not working at the time of the
measuring. The statistician ascertains that the engineer would not
have held up the measurements until the meter was fixed, and informs
him that new measurements are required. The engineer is astounded.

The story is obviously meant to be silly but it’s not clear to me what liberties are being taken with the methodology it pokes fun at. I’m sure in this case a busy applied statistician wouldn’t worry over this but what about a hardcore academic frequentist?

Using a dogmatic frequentist approach, would we need to repeat the experiment? Could we draw any conclusions from the already available data?

To also address the more general point made by the story, if we want to make use of the data we already have, could the needed revision of hypothetical outcomes be made to fit in the frequentist framework?

In frequentist inference, we want to determine how frequently something would have happened if a given stochastic process were repeatedly realized. That is the starting point for the theory of p-values, confidence intervals, and the like. However, in many applied projects, the “given” process is not really given, and the statistician has to do at least some work specifying and modeling it. This can be a surprisingly ambiguous problem, as it is in this case.

# Modeling the Data Generation Process

Based on the information given, our best candidate seems to be the following:

1. If the 100V meter reads 100V, the engineer re-measures with the 1000V meter if it is operational. Otherwise, he simply marks 100V and moves on.

But isn’t this a bit unfair to our engineer? Assuming he is an engineer and not merely a technician, he probably understands why he needs to re-measure when the first meter reads 100V; it’s because the meter is saturated at the upper limit of its range, hence no longer reliable. So perhaps what the engineer would really do is

1. If the 100V meter reads 100, the engineer re-measures with the 1000V meter if it is operational. Otherwise, he simply marks 100V, appends a plus sign to indicate the saturated measurement, and moves on.

Both of these processes are consistent with the data we have, but they are different processes, and they yield different confidence intervals. Process 2 is the one we would prefer as statisticians. If the voltages are often well above 100V, Process 1 has a potentially catastrophic failure mode in which the measurements are occasionally severely underestimated, because the data are censored without our knowing it. The confidence interval will widen accordingly. We could mitigate this by asking the engineer to tell us when his 1000V meter is not working, but this is really just another way of ensuring that our data conforms to Process 2.

If the horse has already left the barn and we cannot determine when the measurements are and aren’t censored, we could try to infer from the data the times when the 1000V meter isn’t working. By introducing an inference rule into the process, we effectively create a new Process 1.5 distinct from both 1 and 2. Our inference rule would sometimes work and sometimes not, so the confidence interval from Process 1.5 would be intermediate in size compared to Processes 1 and 2.

In theory, there is nothing wrong or suspicious about a single statistic having three different confidence intervals associated with three different plausibly representative stochastic processes. In practice, few consumers of statistics want three different confidence intervals. They want one, the one that is based on what would have actually happened, had the experiment been repeated many times. So typically, the applied statistician considers the domain knowledge she has acquired during the project, makes an educated guess, and presents the confidence interval associated with the process she has guessed. Or she works with the customer to formalize the process, so there’s no need to guess going forward.

# How to Respond to New Information

Despite the insistence of the statistician in the story, frequentist inference does not require that we repeat measurements when we gain new information suggesting the generating stochastic process is not quite what we originally conceived. However, if the process is going to be repeated, we do need to ensure that all repetitions are consistent with the model process assumed by the confidence interval. We can do this by changing the process or by changing our model of it.

If we change the process, we may need to discard past data which was collected inconsistently with that process. But that isn’t an issue here, because all the process variations we’re considering are only different when some of the data is above 100V, and that never happened in this case.

Whatever we do, model and reality must be brought into alignment. Only then will the theoretically guaranteed frequentist error rate be what the customer actually gets upon repeated performance of the process.

# The Bayesian Alternative

On the other hand, if all we really care about is the probable range of the true mean for this sample, we should cast aside frequentism entirely and seek out the people who sell the answer to that question – the Bayesians. If we go this route, all the haggling over counterfactuals becomes irrelevant; all that matters is the prior and likelihood. In exchange for this simplification, we lose any hope of guaranteeing an error rate under repeated performance of the “experiment”.

# Why the Fuss?

This story was constructed to make it look like the frequentist statistician fusses over silly things for no reason. Honestly, who cares about these silly counterfactuals? The answer, of course, is that everyone should care. Vitally important scientific fields are currently suffering from a serious replication crisis, which suggests the frequency of false discoveries is much higher than expected in the scientific literature. One of the drivers of this crisis, although not the only one by any means, is the rise of p-hacking, which is when researchers play with many variations of a model, controlling for different variables, until they get significance.

P-hacking has been extensively vilified in the popular scientific media and the blogosphere, but few actually understand what is wrong about p-hacking and why. Contrary to popular statistical opinion, there is nothing wrong with looking at your data before, during, and after the modeling process. What is wrong is failing to report exploratory analyses and how they influenced the course of the study. Only by looking at the full process can we even possibly determine what stochastic model is representative of that process and what frequentist analysis is appropriate for that model, if any.

Claiming that a certain frequentist analysis is appropriate is a very serious claim. Making that claim implies that you are binding yourself to the discipline of the stochastic process you have chosen, which entails an entire system of counterfactuals about what you would have done in different situations. You have to actually conform to that system for the frequentist guarantee to apply to you. Very few researchers, especially those in fields that emphasize open-ended exploration, conform to the system, and they do not report their deviations scrupulously; that is why we now have a replication crisis on our hands. (Some respected researchers have argued that this expectation is unrealistic, a position I sympathize with, but that is getting beyond the scope of this post.)

It might seem unfair that we are criticizing published papers based on a claim about what they would have done had the data been different. But this is the (somewhat paradoxical) nature of frequentist reasoning: if you accept the concept of the p-value, you have to respect the legitimacy of modeling what would have been done under alternative data. (Gelman & Loken, 2013)

In studies that are relatively simple and/or standardized, such as clinical trials, we can adjust for things like multiple or sequential comparisons and maintain the theoretical error rate; in more complex and exploratory studies, a frequentist model may be inapplicable because the researcher may not be fully conscious of all the decisions being made, let alone recording and presenting them explicitly. In such cases, the researcher should (1) be honest and upfront about what was done; (2) present p-values either with strong caveats, or not at all; (3) consider presenting other lines of evidence, such as prior plausibility of the hypothesis or a follow-up replication study.