# Three questions about the article “Ditch p-values. Use Bootstrap confidence intervals instead”

I am not a statistician by training and I was asked by students to explain them an article called “Ditch p-values. Use Bootstrap confidence intervals instead” . The author seems a prominent academic, however, I am confused about some of the material there. Please, ignore this post if it seems too long for you. I cut it to just 3 questions, I will infer other answers based on these.

Let’s take a simplified but revealing example: we want to determine Robert’s citizenship. Null hypothesis: H0, Robert is a US citizen. Alternative hypothesis: H1, he is not. Our data: we know that Robert is a US senator. There are 100 senators out of 330 million US citizens, so under the null hypothesis, the probability of our data (i.e., the p-value) is 100 / 300,000,000 ≈ 0.000000303. Per the rules of statistical significance, we can safely conclude that our null hypothesis is rejected and Robert is not a US citizen.

Am I right that this is not a p-value (which is the probability to see this or more extreme value of a test statistic)? Is it a correct procedure for a statistical testing? I have a gut feeling that it is a wrong situation to apply hypothesis testing, but I can not formally answer why.

P-values were invented at a time when all calculations had to be done by hand, and so they rely on simplifying statistical assumption. Broadly speaking, they assume that the phenomenon you’re observing obeys some regular statistical distribution

It seems to be wrong, but the question is: can we say that non-parametric tests also rely on some regular statistical distributions? Not only they have assumptions, but also, technically, their statistics also follow some distributions.

Let’s say that a business decision-maker is pondering two possible actions, A and B. Based on observed data, the probability of zero or negative benefits is:

0.08 for action A

0.001 for action B

Should the decision-maker pick action B based on these numbers? What if I told you that the corresponding 90%
confidence intervals are:

[-0.5m; 99.5m] for action A [0.1m; 0.2m] for action B Action B may
have a lower probability of leading to a zero or negative outcome, but
its expected value for the business is much lower, unless the business
is incredibly risk-averse.

Can we, based on confidence intervals, say, what is an expected value? Is in this situation a clear decision? I always thought that confidence intervals are not necessarily symmetric, but I started to doubt here.

### 1 They don’t mean what people think they mean

Am I right that this is not a p-value (which is the probability to see this or more extreme value of a test statistic)? Is it a correct procedure for a statistical testing? I have a gut feeling that it is a wrong situation to apply hypothesis testing, but I can not formally answer why.

One could argue that technically speaking it is a p-value. But, it is a rather meaningless p-value. There are two ways to look at it as a meaningless p-value

• Neyman and Pearson suggest that, in order to compute the p-value, you choose the region where the likelihood ratio (between the null hypothesis and alternative hypothesis) is the highest. You count observations as ‘extreme’ when a deviation from the null hypothesis would mean more likelihood to make that extreme observation.

This is not the case with the US citizen example. If the null hypothesis ‘Robert is a US citizen’ is false, then the observation ‘Robert is a US senator’ is in no way more likely. So from the viewpoint of Neyman’s and Pearson’s approach to hypothesis testing, this is a very bad type of calculation for a p-value.

• From the viewpoint of Fisher’s approach to hypothesis testing, you have a measurement of some effect and the point of the p-value is to quantify the statistical significance. It is useful as an expression of the precision of an experiment.

The p-value quantifies how good/accurate the experiment is in the quantification of the deviation. Statistically speaking effects will always occur to some extent due to random fluctuations in the measurements. An observation is seen as statistically significant when it is a fluctuation of a sufficiently large size such that it has a low probability that we observe a seemingly effect when there is actually no effect (when the null hypothesis is true). Experiments that have a high probability that we observe an effect while there is actually no effect are not very useful. We use p-values to express this probability.

By reporting p-values researchers can show that their experiments have sufficiently small noise and sufficiently large sample size, such that the observed effects are statistically significant (unlikely to be just noise).

Fisher’s p-values are an expression of the noise and random fluctuations, they are a sort of expression of signal/noise ratio. The advice is to only reject a hypothesis when an effect is sufficiently large compared to the noise level.

Even though there is no alternative hypothesis in Fisher’s viewpoint, when we express a p-value then this is done for the measurement of some effect as a deviation relative to a null (no effect) hypothesis. There must be some sense of a direction that can be considered to be an effect or a deviation.

In the case of the experiment with US citizenship, the measurement of ‘Robert is a US senator’ has nothing to do with the measurement of some effect or a deviation from the null hypothesis. Expressing a p-value for it is meaningless.

The example with US citizenship may be a bit weird and wrong. However, it is not meant to be correct. The point is to show that simply a p-value is not very meaningful and correct. What we need to consider is also the power of a test (and that is missing in the example with US citizenship). A low p-value might be nice, but what if the p-value would be just as well low, or even lower, for an alternative explanation? If you have a bad hypothesis test then we could ‘reject a hypothesis’ based on a (crappy) low p-value while actually, no alternative hypothesis is better suitable.

Example 1: Say you have two jars one with 50% gold and 50% silver coins, the other with 75% gold and 25% silver coins. You take 10 coins out of one jar, and they are all silver, which jar do we have? We could say that the prior odds were 1:1 and the posterior odds are 1:1024. We can say that the jar is very likely the one with 50:50 gold:silver, but both hypotheses are unlikely when we observe 10 silver coins and maybe we should mistrust our model.

Example 2: Say you have data that is distributed by a quadratic curve y = a + c x^2. But you fit it with a straight linear line y = a + b x. When we fit a model we find that the p-value is extremely low for a zero slope (no effect) since the data does not match a flat line (as it is following a quadratic curve). But does that mean that we should reject the hypothesis that the coefficient b is zero? The discrepancy, low p-value, is not because the null hypothesis is false, but because out entire model is false (that is the actual conclusion when the p-value is low, the null hypothesis and/or the statistical model is false).

### 2 They rely on hidden assumptions

It seems to be wrong, but the question is: can we say that non-parametric tests also rely on some regular statistical distributions? Not only they have assumptions, but also, technically, their statistics also follow some distributions

The point of non-parametric tests is that we make no assumptions about the data. But the statistic that we compute may follow some distribution.

Example: We wonder whether one sample is larger than another sample. Let’s say that the samples are paired. Then without knowing anything about the distribution we can just count which of the pairs is larger. Independent of the distribution of the population from which the sample has been taken, this sign statistic will follow a binomial distribution.

So the point of non-parametric tests is not that the statistic that is being computed has no distribution, but that the distribution of the statistic is independent from the distribution of the data.

The point of this “They rely on hidden assumptions” is correct. However, it is a bit harsh and sketches the assumptions in a limited sense (as if assumptions are only simplifications to make computations easy).

Indeed many models are simplifications. But I would say that the parametric distributions are still useful, even when we have much more computation power nowadays and simplifications are not necessary. The reason is that parametric distributions are not always simplifications.

• On the one hand: Bootstrapping or other simulations can approach the same result as a computation, and when the computation makes assumptions, approximations and simplifications then the bootstrapping may even do better.

• On the other hand: The parametric distribution, if it is true, gives you information that bootstrapping can’t give you. When you have only little amount of data then you can’t get a proper estimate of p-values or confidence intervals. With parametric distributions you can fill the gap.

Example: if you have ten samples from a distribution, then you might estimate the quantile at multiples of 10%, but you won’t be able to estimate smaller quantiles. If you know that the distribution can be approximated by some distribution (based on theory and previous knowledge such assumptions might not be bad) then you can use a fit with the parametric distribution to interpolate and extrapolate the ten samples to other quantiles.

Example 2: The representation of parametric tests as being only useful for making calculations easier is a straw man argument. It is not true because it is far from the only reason. The main reason why people use parametric tests is because they are more powerful. Compare for instance the parametric t-test with the non-parametric Mann-Whitney U test. The choice for the former is not because the computation is easier, but because it can be more powerfull.

### 3 They detract from the real questions

Can we, based on confidence intervals, say, what is an expected value? Is in this situation a clear decision? I always thought that confidence intervals are not necessarily symmetric, but I started to doubt here.

No, confidence intervals do not give full information. You should instead compute some cost function that quantifies all consideration in the decision (requiring the full distribution).

But confidence intervals may be a reasonable indication. The step from a single point estimate to a range is a big difference and adds an entire new dimension to the representation.

Your criticism here is also exactly the point of the author of the blogpost. You criticize the confidence intervals not giving full information. But the means 0.08 for action A and 0.001 for action B have even less information than the confidence intervals, and that is what the author is pointing out.

This third point is more a matter of point estimate versus interval estimates. Maybe p-values promote the use of point estimates, but it is a bit far-fetched to use it as criticism against p-values. The example is not even a case that is about p-values and it is about a Bayesian posterior for two situations.