Welch’s t-test gives worse p-value for more extreme difference

Here are four different sets of numbers:

A = {95.47, 87.90, 99.00}
B = {79.2, 75.3, 66.3}
C = {38.4, 40.4, 32.8}
D = {1.8, 1.2, 1.1}

Using a two-sample t-test without assuming equal variances, I compare B, C, and D to A and get the following p-values:

0.015827 (A vs B)
0.000283 (A vs C)
0.001190 (A vs D)

I find it strange that the p-value from the A-D test is worse than the A-C test: the difference between the means is clearly much bigger AND the variance of D is much lower than the variance of C. Intuitively (at least for my intuition), both these facts should drive the p-value lower.

Could someone explain if this is a desired or expected behavior of the t-test or whether it has to do more with my particular data set (extreme low sample size perhaps?). Is the t-test inappropriate for this particular set of data?

From a purely computational point of view, the reason for a worse p-value seems to be the degrees of freedom, which in the A-D comparison is 2.018 while it is 3.566 in the A-C comparison. But surely, if you just saw those numbers, wouldn’t you think that there is stronger evidence for rejecting the null hypothesis in the A-D case compared to A-C?

Some might suggest that this is not a problem here since all p-values are quite low anyway. My problem is that these 3 tests are part of a suite of tests that I am performing. After correcting for multiple testing, the A-D comparison doesn’t make the cut, while the A-C comparison does. Imagine plotting those numbers (say bar-plots with error bars as biologists often do) and trying to justify why C is significantly different from A but D is not… well, I can’t.

Update: why this is really important

Let me clarify why this observation could have a great impact on interpreting past studies. In bioinfomatics, I have seen the t-test be applied to small sample sizes on a large scale (think differential gene expression of hundreds or thousands of genes, or the effect of many different drugs on a cell line, using only 3-5 replicates). The usual procedure is to do many t-tests (one for each gene or drug) followed by multiple testing correction, usually FDR. Given the above observation of the behaviour of Welch’s t-test, this means that some of the very best cases are being systematically filtered out. Although most people will look at the actual data for the comparisons at the top of their list (the ones with best p-values), I don’t know of anyone who will look through the list of all comparisons where the null hypothesis wasn’t rejected.


Yes, it’s the degrees of freedom. The t-statistics themselves increase as we compare groups B,C,D to A; the numerators get bigger and the denominators get smaller.

Why doesn’t your approach “work”? Well, the Satterthwaite approximation for the degrees of freedom, and the reference distribution is (as the name suggests!) just an approximation. It would work fine if you had more samples in each group, and not hugely heavy-tailed data; 3 observations per group is really very small for most purposes. (Also, while p-values are useful for doing tests, they don’t measure evidence and don’t estimate parameters with direct interpretations in terms of data.)

If you really want to work out the exact distribution of the test statistic – and a better calibrated p-value – there are methods cited here that could be used. However, they rely on assuming Normality, an assumption you have no appreciable ability to check, here.

Source : Link , Question Author : ALiX , Answer Author : guest

Leave a Comment