High variance of the distribution of p-values (an argument in Taleb 2016)

I’m trying to understand the big picture claim made in Taleb, 2016, The Meta-Distribution of Standard P-Values.

In it, Taleb makes the following argument for the unreliability of the p-value (as I understand it):

An estimation procedure operating on $n$ data points coming from some distribution $X$ outputs a p value. If we draw n more points from this distribution and output another p value, we can average these p-values obtaining in the limit the so-called “true p-value”.

This “true p-value” is shown to have a disturbingly high variance, so that a distribution+procedure with “true p value” $.12$ will 60% of the time report a p-value of <.05.

Question: how can this be reconciled with the traditional argument in favor of the $p$-value. As I understand it, the p-value is supposed to tell you what percentage of the time your procedure will give you the correct interval (or whatever). However, this paper seems to argue that this interpretation is misleading since the p-value will not be the same if you run the procedure again.

Am I missing the point?

A p-value is a random variable.

Under $H_0$ (at least for a continuously-distributed statistic), the p-value should have a uniform distribution

For a consistent test, under $H_1$ the p-value should go to 0 in the limit as sample sizes increase toward infinity. Similarly, as effect sizes increase the distributions of p-values should also tend shift toward 0, but it will always be “spread out”.

The notion of a “true” p-value sounds like nonsense to me. What would it mean, either under $H_0$ or $H_1$? You might for example say that you mean “the mean of the distribution of p-values at some given effect size and sample size“, but then in what sense do you have convergence where the spread should shrink? It’s not like you can increase sample size while you hold it constant.

Here’s an example with one sample t-tests and a small effect size under $H_1$. The p-values are nearly uniform when the sample size is small, and the distribution slowly concentrates toward 0 as sample size increases.

This is exactly how p-values are supposed to behave – for a false null, as the sample size increases, the p-values should become more concentrated at low values, but there’s nothing to suggest that the distribution of the values it takes when you make a type II error – when the p-value is above whatever your significance level is – should somehow end up “close” to that significance level.

What, then, would a p-value be an estimate of? It’s not like it’s converging to something (other than to 0). It’s not at all clear why one would expect a p-value to have low variance anywhere but as it approaches 0, even when the power is quite good (e.g. for $\alpha=0.05$, power in the n=1000 case there is close to 57%, but it’s still perfectly possible to get a p-value way up near 1)

It’s often helpful to consider what’s happening both with the distribution of whatever test statistic you use under the alternative and what applying the cdf under the null as a transformation to that will do to the distribution (that will give the distribution of the p-value under the specific alternative). When you think in these terms it’s often not hard to see why the behavior is as it is.

The issue as I see it is not so much that there’s any inherent problem with p-values or hypothesis testing at all, it’s more a case of whether the hypothesis test is a good tool for your particular problem or whether something else would be more appropriate in any particular case — that’s not a situation for broad-brush polemics but one of careful consideration of the kind of questions that hypothesis tests address and the particular needs of your circumstance. Unfortunately careful consideration of these issues are rarely made — all too often one sees a question of the form “what test do I use for these data?” without any consideration of what the question of interest might be, let alone whether some hypothesis test is a good way to address it.

One difficulty is that hypothesis tests are both widely misunderstood and widely misused; people very often think they tell us things that they don’t. The p-value is possibly the single most misunderstood thing about hypothesis tests.