# Why are the number of false positives independent of sample size, if we use p-values to compare two independent datasets?

If I run the R code below, it will generate two independent vectors then test them to see if they are related in some way (i.e. p-value < 0.05).

If I repeat this 1,000 times, then 50 of these (5%) will have a false positive, with a p-value < 0.05. This is a type 1 error.

If we increase sampleSize to 1000, or even 100,000, the results are the same (5% false positives).

I am struggling to understand this, as I would have thought that if we collected enough samples, the chance of a false positive would drop towards 0 (just like correlation).

So, I guess my question is: “How is it possible that the number of false positives, based on p-values generated from comparing two independent datasets, be independent of sample size?”.

# R code to demonstrate that with a large dataset, we can still
# get significant p-values, purely by chance.

# Change this to 1000, and we still get the same number of false positives (50, or 5%).
sampleSize = 20

cat("Sample size:",sampleSize,"\n")

set.seed(1010093)
n=1000
pValues <- rep(NA,n)
for(i in 1:n){
y <- rnorm(sampleSize)
x <- rnorm(sampleSize)
pValues[i] <- summary(lm(y ~ x))\$coeff[2,4]
}
# Controls false positive rate
fp = sum(pValues < 0.05)
cat("Out of ",n," tests, ",fp," had a p-value < 0.05, purely by by chance.\n",sep="")

----output----
Running "true positives - none.R" ...
Sample size: 1000
Out of 1000 tests, 52 had a p-value < 0.05, purely by by chance.


I think this question is caused by a fundamental confusion about how the Neyman-Pearson paradigm for statistical hypothesis testing works, but a very interesting one. The central analogy I will use to discuss this is the idea of absolute vs. relative reference in computer science, which will be most familiar to people via how it plays out in writing functions in Excel. (There is a quick guide to this here.) The idea is that something can be fixed to a given position on an absolute scale, or can only be in a position relative to something else; the result of which is that if the ‘something else’ changes, the latter will change also, but the former would remain the same.

The central concept in hypothesis testing, on which everything else is built, is that of a sampling distribution. That is, how a sample statistic (like a sample slope) will bounce around if an otherwise identical study is conducted over and over ad infinitum. (For additional reference, I have discussed this here and here.) For statistical inference you need to know three things about the sampling distribution: its shape, its mean, and its standard deviation (called the standard error). Given some standard assumptions, if the errors are normally distributed, the sampling distribution of the slope of a regression model will be normally distributed, centered on the true value, with a $SE=\frac{1}{\sqrt{N-1}}\sqrt{s^2/\Sigma(x_i-\bar x)^2}$. (If the residual variance, $s^2$, is estimated from your data, the sampling distribution will be $t_{df=N-(1+p)}$, where $N$ is the number of data you have, and $p$ is the number of predictor variables.)

In the Neyman-Pearson version of hypothesis testing, we have a reference, or null, value for a sample statistic, and an alternative value. For instance, when assessing the relationship between two variables, the null value of the slope is typically 0, because that would mean there is no relationship between the variables, which is often an important possibility to rule out for our theoretical understanding of a topic. The alternative value can be anything, it might be a value posited by some theory, or it might be the smallest value that someone would care about from a practical standpoint, or might be something else. Let’s say that the null and alternative hypotheses regarding the true value of the slope of the relationship between $X$ and $Y$ in the population are 0 and 1, respectively. These numbers refer to an absolute scale–no matter what you choose for $\alpha$, $\beta$ / power, $N$, etc., they will remain the same. If we stipulate some values ($\alpha=.05$, $s^2=1$, $\text{Var}(X)=1$, & $N=10$), we can calculate some things like what the sampling distributions would look like under the null and alternative hypotheses, or how much power the test would have.

Now, how should we go about deciding whether to reject the null hypothesis under this scenario? There are at least two ways: We could check if our p-value is less than $\alpha$, or just check if our beta is greater than the absolute numerical value that corresponds to those numbers in this situation. The key thing to realize is that the former is relative to the sampling distribution under the null hypothesis, but the latter is an absolute position on the number line. If we calculated the sampling distributions again, but with $N = 25$, they would look different (i.e., they would have narrower standard deviations), but those values defined relative to the sampling distribution of the null would have the same relationship to the null hypothesis test, because they are defined that way. That is, e.g., the upper 2.5% of the null sampling distribution would still comprise 2.5% of the total area under the curve, but that line would have moved relative to the absolute numerical scale underneath it. On the other hand, if we only rejected the null if our estimated beta were greater than the value we calculated above, we would be less and less likely to reject the null if we kept that value in place and continually increased $N$.

Consider the figure below. The alpha threshold is defined as that point which demarcates the outermost 5% of the area under the curve (here I have displayed only the upper tail, the lower tail would work the same way). When $N = 10$, this happens to fall at $X = .735$ (given the values we stipulated above). When $N$ increases to $25$, the standard error shrinks and the sampling distribution becomes ‘narrower’. Because $\alpha = .05$ is defined relative to the sampling distribution, it shifts in along with the rest of the sampling distribution. The corresponding value of the sample slope becomes $.417$. If the threshold had stayed in the same place on the absolute scale ($.735$), the rate of false positives would have fallen to $.00056$.

Note that this latter approach to hypothesis testing, that of comparing the observed actual value of the sample slope to a fixed cutoff point, is very much not the way hypothesis testing is done, but I believe this is the basis for your confusion.