I recently started reading the Bayesian criticism of the pvalue and it seems that there is a lot of discussion around the fact that a frequentist approach is not that good when the Null Hypothesis is true.
For instance in this paper the authors write that “pvalues overstate the evidence against the null […] this does not have to do with typeI or typeII errors; it is an “independent” property of pvalue.”
To illustrate this point, the authors show that when the null is true, the pvalue has a uniform distribution.
What I do not get is that even when the null is true, a frequentist approach, thanks to the Central Limit Theorem, is still able to construct confidence intervals that includes 0 (nonsignificance) at the appropriate \alpha level.
I do not get why the fact that the pvalue is uniform when the null is true shows that a frequentist approach is biased. And what does it mean “independent property of pvalue”?
library(tidyverse) library(broom) n=1000 x = rnorm(n,100,30) d = 0 y = x*d + rnorm(n,0,20) df = data.frame(y,x) plot(x,y) abline(lm(y~x), col = 'red') r = replicate(1000, sample_n(df, size = 50), simplify = F) m = r %>% map(~ lm(y~x,data = .)) %>% map(tidy) # Central Limit Theorem bind_rows(.id = 'sample', m) %>% filter(term =='x') %>% ggplot(aes(estimate)) + facet_grid(~term) + geom_histogram() s = bind_rows(.id = 'sample', m) %>% filter(term =='x') s$false_positive = ifelse(s$p.value < 0.05, 1, 0) prop.table(table(s$false_positive)) # uniform hist(s$p.value, breaks = 50)
Answer
The point that the authors are trying to make is a subtle one: they see it as a failure of NHST that, as n gets arbitrarily large, the pvalue doesn’t tend to 1. It’s a bit surprising that this doesn’t contain any discussion of equivalence testing. To me it’s somewhat obvious and reasonable that the pvalue maintains its uniform distribution when the null is true considering larger and larger n. Large n means having sensitivity to detect smaller and smaller effects, while the false positive error rate remains fixed. So under the somewhat constrained setting of the null being exactly true, the behavior of the pvalue distribution doesn’t depend on n at all.

NHST is, in my mind, desirable specifically because there’s no way of declaring a null hypothesis to be true, as my experimental design is setup specifically to disprove it. A nonsignificant result may mean that my experiment was underpowered or the assumptions were wrong, so there are risks associated with accepting the null that I’d rather not incur.

We never actually believe that the null hypothesis is true. Typically failed designs arise because the truth is too close to the null to be detectable. Having too much data is kind of a bad thing in this case, rather there’s a subtle art in designing a study to obtain only enough sample size so as to reject the null when a meaningful difference is present.

One can design a frequentist test that sequentially tests for differences (one or two tailed), and depending on a negative result, performs an equivalence test (declare that the null is true as a significant result). In the latter case one can show that the power of an equivalence test goes to 1 when the null is in fact true.
Attribution
Source : Link , Question Author : giac , Answer Author : AdamO