# P-values equal to 0 in permutation test

I have two datasets and I would like to know if they are significantly different or not (this comes from “Two groups are significantly different? Test to use“).

I decided to use a permutation test, doing the following in R:

permutation.test <- function(coding, lncrna) {
coding <- coding[,1] # dataset1
lncrna <- lncrna[,1] # dataset2

### Under null hyphotesis, both datasets would be the same. So:
d <- c(coding, lncrna)

# Observed difference
diff.observed = mean(coding) - mean(lncrna)
number_of_permutations = 5000
diff.random = NULL

for (i in 1:number_of_permutations) {
# Sample from the combined dataset
a.random = sample (d, length(coding), TRUE)
b.random = sample (d, length(lncrna), TRUE)
# Null (permuated) difference
diff.random[i] = mean(b.random) - mean(a.random)
}

# P-value is the fraction of how many times the permuted difference is equal or more extreme than the observed difference
pvalue = sum(abs(diff.random) >= abs(diff.observed)) / number_of_permutations
pvalue
}

Nevertheless, p-values should not be 0 according to this paper: http://www.statsci.org/smyth/pubs/permp.pdf

What do you recommend me to do? Is this way to calculate the p-value:

pvalue = sum(abs(diff.random) >= abs(diff.observed)) / number_of_permutations

a good way? Or is it better doing the following?

pvalue = sum(abs(diff.random) >= abs(diff.observed)) + 1 / number_of_permutations + 1

### Discussion

A permutation test generates all relevant permutations of a dataset, computes a designated test statistic for each such permutation, and assesses the actual test statistic in the context of the resulting permutation distribution of the statistics. A common way to assess it is to report the proportion of statistics which are (in some sense) “as or more extreme” than actual statistic. This is often called a “p-value.”

Because the actual dataset is one of those permutations, its statistic will necessarily be among those found within permutation distribution. Therefore, the p-value can never be zero.

Unless the dataset is very small (less than about 20-30 total numbers, typically) or the test statistic has a particularly nice mathematical form, is not practicable to generate all the permutations. (An example where all permutations are generated appears at Permutation Test in R.) Therefore computer implementations of permutation tests typically sample from the permutation distribution. They do so by generating some independent random permutations and hope that the results are a representative sample of all the permutations.

Therefore, any numbers (such as a “p-value”) derived from such a sample are only estimators of the properties of the permutation distribution. It is quite possible–and often happens when effects are large–that the estimated p-value is zero. There is nothing wrong with that, but it immediately raises the heretofore neglected issue of how much could the estimated p-value differ from the correct one? Because the sampling distribution of a proportion (such as an estimated p-value) is Binomial, this uncertainty can be addressed with a Binomial confidence interval.

### Architecture

A well-constructed implementation will follow the discussion closely in all respects. It would begin with a routine to compute the test statistic, as this one to compare the means of two groups:

diff.means <- function(control, treatment) mean(treatment) - mean(control)

Write another routine to generate a random permutation of the dataset and apply the test statistic. The interface to this one allows the caller to supply the test statistic as an argument. It will compare the first m elements of an array (presumed to be a reference group) to the remaining elements (the “treatment” group).

f <- function(..., sample, m, statistic) {
s <- sample(sample)
statistic(s[1:m], s[-(1:m)])
}

The permutation test is carried out first by finding the statistic for the actual data (assumed here to be stored in two arrays control and treatment) and then finding statistics for many independent random permutations thereof:

z <- stat(control, treatment) # Test statistic for the observed data
sim<- sapply(1:1e4, f, sample=c(control,treatment), m=length(control), statistic=diff.means)

Now compute the binomial estimate of the p-value and a confidence interval for it. One method uses the built-in binconf procedure in the HMisc package:

require(Hmisc)                                    # Exports binconf
k <- sum(abs(sim) >= abs(z))                      # Two-tailed test
zapsmall(binconf(k, length(sim), method='exact')) # 95% CI by default

It’s not a bad idea to compare the result to another test, even if that is known not to be quite applicable: at least you might get an order of magnitude sense of where the result ought to lie. In this example (of comparing means), a Student t-test usually gives a good result anyway:

t.test(treatment, control)

This architecture is illustrated in a more complex situation, with working R code, at Test Whether Variables Follow the Same Distribution.

### Example

As a test, I generated $10$ normally distributed “control” values from a distribution with mean $0$ and $20$ normally distributed “treatment” values from a distribution with mean $1.5$.

set.seed(17)
control <- rnorm(10)
treatment <- rnorm(20, 1.5)

After using the preceding code to run a permutation test I plotted the sample of the permutation distribution along with a vertical red line to mark the actual statistic:

h <- hist(c(z, sim), plot=FALSE)
hist(sim, breaks=h\$breaks)
abline(v = stat(control, treatment), col="Red")

The Binomial confidence limit calculation resulted in

PointEst Lower        Upper
0     0 0.0003688199

In other words, the estimated p-value was exactly zero with a (default 95%) confidence interval from $0$ to $0.00037$. The Student t-test reports a p-value of 3.16e-05, which is consistent with this. This supports our more nuanced understanding that an estimated p-value of zero in this case corresponds to a very small p-value that we can legitimately take to be less than $0.00037$. That information, although uncertain, usually suffices to make a definite conclusion about the hypothesis test (because $0.00037$ is far below common thresholds of $0.05$, $0.01$, or $0.001$).

When $k$ out of $N$ values in the sample of the permutation distribution are considered “extreme,” then both $k/N$ and $(k+1)/(N+1)$ are reasonable estimates of the true p-value. (Other estimates are reasonable, too.) Normally there is little reason to prefer one to the other. If they lead to different decisions, that means $N$ is too small. Take a larger sample of the permutation distribution instead of fudging the way in which the p-value is estimated.
If greater precision in the estimate is needed, just run the permutation test longer. Because confidence interval widths typically scale inversely proportional to the square root of the sample size, to improve the confidence interval by a factor of $10$ I ran $10^2=100$ times as many permutations. This time the estimated p-value was $0.000005$ (five of the permutation results were at least as far from zero as the actual statistic) with a confidence interval from $1.6$ through $11.7$ parts per million: a little smaller than the Student t-test reported. Although the data were generated with normal random number generators, which would justify using the Student t-test, the permutation test results differ from the Student t-test results because the distributions within each group of observations are not perfectly normal.