Test whether variables follow the same distribution

If you want to test whether two variables follow the same distribution, would it be a good test to simply sort both variables, and then check their correlation? If it is high (at least 0.9?), then the variables most likely come from the same distribution.

With distribution here I mean “normal”, “chi-square”, “gamma” etc.

Answer

Let’s find out whether this is a good test or not. There’s a lot more to it than just claiming it’s bad or showing in one instance that it doesn’t work well. Most tests work poorly in some circumstances, so often we are faced with identifying the circumstances in which any proposed test might possibly be a good choice.

Description of the test

Like any hypothesis test, this one consists of (a) a null and alternate hypothesis and (b) a test statistic (the correlation coefficient) intended to discriminate between the hypotheses.

The null hypothesis is that the two variables come from the same distribution. To be precise, let us name the variables X and Y and assume we have observed nx instances of X, called xi=(x1,x2,,xnx), and ny instances of Y, called yi. The null hypothesis is that all instances of X and Y are independent and identically distributed (iid).

Let us take as the alternate hypothesis that (a) all instances of X are iid according to some underlying distribution FX and (b) all instances of Y are iid according to some underlying distribution FY but (c) FX differs from FY. (Thus, we will not be looking for correlations among the xi, correlations among the yi, correlations between the xi and yj, or differences of distribution among the x‘s or y‘s separately: that’s assumed not to be plausible.)

The proposed test statistic assumes that nx=ny (call this common value n) and computes the correlation coefficient of the (x[i],y[i]) (where, as usual, [i] designates the ith smallest of the data). Call this t(x,y).

Permutation tests

In this situation–no matter what statistic t is proposed–we can always conduct a permutation test. Under the null hypothesis, the likelihood of the data ((x1,x2,,xn),(y1,y2,,yn)) is the same as the likelihood of any permutation of the 2n data values. In other words, the assignment of half the data to X and the other half to Y is a pure random coincidence. This is a simple, direct consequence of the iid assumptions and the null hypothesis that FX=FY.

Therefore, the sampling distribution of t(x,y), conditional on the observations xi and yi, is the distribution of all the values of t attained for all (2n)! permutations of the data. We are interested in this because for any given intended test size α, such as α=.05 (corresponding to 95% confidence), we will construct a two-sided critical region from the sampling distribution of t: it consists of the most extreme 100α% of the possible values of t (on the high side, because high correlation is consistent with similar distributions and low correlation is not). This is how we go about determining how large the correlation coefficient must be in order to decide the data come from different distributions.

Simulating the null sampling distribution

Because (2n)! (or, if you like, \binom{2n}{n}/2, which counts the number of ways of splitting the 2n data into two pieces of size n) gets big even for small n, it is not practicable to compute the sampling distribution exactly, so we sample it using a simulation. (For instance, when n=16, \binom{2n}{n}/2 = 300\ 540\ 195 and (2n)! \approx 2.63\times 10^{35}.) About a thousand samples often suffices (and certainly will for the explorations we are about to undertake).

There are two things we need to find out: first, what does the sampling distribution look like under the null hypothesis. Second, how well does this test discriminate between different distributions?

There is a complication: the sampling distribution depends on the nature of the data. All we can do is to look at realistic data, created to emulate whatever it is we are interested in studying, and hope that what we learn from the simulations will apply to our own situation.

Implementation

To illustrate, I have carried out this work in R. It falls naturally into three pieces.

  1. A function to compute the test statistic t(x,y). Because I want to be a little more general, my version handles different size datasets (n_x \ne n_y) by linearly interpolating among the values in the (sorted) larger dataset to create matches with the (sorted) smaller dataset. Because this is already done by the R function qqplot, I just take its results:

    test.statistic <- function(x, y) {
      transform <- function(z) -log(1-z^2)/2
      fit <- qqplot(x,y, plot.it=FALSE)
      transform(cor(fit$x, fit$y))
    }
    

    A little twist–unnecessary but helpful for visualization–re-expresses the correlation coefficient in a way that will make the distribution of the null statistic approximately symmetric. That’s what transform is doing.

  2. The simulation of the sampling distribution. For input this function accepts the number of iterations n.iter along with the two sets of data in arrays x and y. It outputs an array of n.iter values of the test statistic. Its inner workings should be transparent, even to a non R user:

    permutation.test <- function(n.iter, x, y) {
      z <- c(x,y)
      n.x <- length(x)
      n.y <- length(y)
      n <- length(z)
      k <- min(n.x, n.y)
      divide <- function() {
        i <- sample.int(n, size=k)
        test.statistic(z[i], z[-i])
      }
      replicate(n.iter, divide())
    }
    
  3. Although that’s all we need to conduct the test, in order to study it we will want to repeat the test many times. So, we conduct the test once and wrap that code within a third functional layer, just generally named f here, which we can call repeatedly. To make it sufficiently general for a broad study, for input it accepts the sizes of the datasets to simulate (n.x and n.y), the number of iterations for each permutation test (n.iter), a reference to the function test to compute the test statistic (you will see momentarily why we might not want to hard-code this), and two functions to generate iid random values, one for X (dist.x) and one for Y (dist.y). An option plot.it is useful to help see what’s going on.

    f <- function(n.x, n.y, n.iter, test=test.statistic, dist.x=runif, dist.y=runif, 
        plot.it=FALSE) {
      x <- dist.x(n.x)
      y <- dist.y(n.y)
      if(plot.it) qqplot(x,y)
    
      t0 <- test(x,y)
      sim <- permutation.test(n.iter, x, y)
      p <- mean(sim > t0) + mean(sim==t0)/2
      if(plot.it) {
        hist(sim, xlim=c(min(t0, min(sim)), max(t0, max(sim))), 
             main="Permutation distribution")
        abline(v=t0, col="Red", lwd=2)
      }
      return(p)
    }
    

    The output is a simulated “p-value”: the proportion of simulations yielding a statistic that looks more extreme than the one actually computed for the data.

Parts (2) and (3) are extremely general: you can conduct a study like this one for a different test simply by replacing test.statistic with some other calculation. We do that below.

First results

By default, our code compares data drawn from two uniform distributions. I let it do that (for n.x = n.y = 16, which are fairly small datasets and therefore present a moderately difficult test case) and then repeat it for a uniform-normal comparison and a uniform-exponential comparison. (Uniform distributions are not easy to distinguish from normal distributions unless you have a bit more than 16 values, but exponential distributions–having high skewness and a long right tail–are usually easily distinguished from uniform distributions.)

set.seed(17)             # Makes the results reproducible
n.per.rep <- 1000        # Number of iterations to compute each p-value
n.reps <- 1000           # Number of times to call `f`
n.x <- 16; n.y <- 16     # Dataset sizes

par(mfcol=c(2,3))        # Lay results out in three columns
null <- replicate(n.reps, f(n.x, n.y, n.per.rep))
hist(null, breaks=20)
plot(null)

normal <- replicate(n.reps, f(n.x, n.y, n.per.rep, dist.y=rnorm))
hist(normal, breaks=20)
plot(normal)

exponential <- replicate(n.reps, f(n.x, n.y, n.per.rep, dist.y=function(n) rgamma(n, 1)))
hist(exponential, breaks=20)
plot(exponential)

Correlation test results

On the left is the null distribution of the p-values when both X and Y are uniform. We would hope that the histogram is close to uniform (paying especial attention to the extreme left end, which is in the range of “significant” results)–and it actually is–and that the sequence of values obtained during the simulation, shown below it, looks random–and it does. That’s good. It means we can move on to the next step to study how this changes when X and Y come from different distributions.

The middle plots test 16 uniform variates x_i against 16 normal variates y_i. More often than not, the p-values were lower than expected. That indicates a tendency for this test actually to detect a difference. But it’s not a large one. For instance, the leftmost bar in the histogram shows that out of the 1000 runs of f (comprising 1000 separately simulated datasets), the p-value was less than 0.05 only about 110 times. If we consider that “significant,” then this test has only about an 11% chance of detecting the difference between a uniform and normal distribution based on 16 independent values from each. That’s pretty low power. But maybe it’s unavoidable, so let’s proceed.

The right-hand plots similarly test a uniform distribution against an exponential one. This result is bizarre. This test tends, more often than not, to conclude that uniform data and exponential data look the same. It seems to “think” that uniform and exponential variates are more similar than two uniform variables! What’s going on here?

The problem is that data from an exponential distribution will tend to have a few extremely high values. When you make a scatterplot of those against uniformly-distributed values, there will then be a few points far to the upper right of all the rest. That corresponds to a very high correlation coefficient. Thus, whenever either of the distributions generates a few extreme values, the correlation coefficient is a terrible choice for measuring how different the distributions are. This leads to another even worse problem: as the dataset sizes grow, the chances of obtaining a few extreme observations increase. Thus, we can expect this test to perform worse and worse as the amount of data increase. How very awful… .

A better test

The original question has been answered in the negative. However, there is a well-known, powerful test for discriminating among distributions: the Kolmogorov-Smirnov test. Instead of the correlation coefficient, it computes the largest vertical deviation from the line y=x in their QQ plot. (When data come from the same distribution, the QQ plot tends to follow this line. Otherwise, it will deviate somewhere; the K-S statistic picks up the largest such deviation.)

Here is an R implementation:

test.statistic <- function(x, y) {
  ks.test(x,y)$statistic
}

That’s right: it’s built in to the software, so we only have to call it. But wait! If you read the manual carefully, you will learn that (a) the test supplies a p-value but (b) that p-value is (grossly) incorrect when both x and y are datasets. It is intended for use when you believe you know exactly what distribution the data x came from and you want to see whether that’s true. Thus the test does not properly accommodate the uncertainty about the distribution the data in y came from.

No problem! The permutation test framework is still just as valid. By making the preceding change to test.statistic, all we have to do is re-run the previous study, unchanged. Here are the results.

K-S test study

Although the null distribution is not uniform (upper left), it’s pretty uniform below p=0.20 or so, which is where we really care about its values. A glance at the plot below it (bottom left) shows the problem: the K-S statistic tends to cluster around a few discrete values. (This problem practically goes away for larger datasets.)

The middle (uniform vs normal) and right( uniform vs exponential) histograms are doing exactly the right thing: in the vast majority of cases where the two distributions differ, this test is producing small p-values. For instance, it has a 70% chance of yielding a p-value less than 0.05 when comparing a uniform to a normal based on 16 values from each. Compare this to the piddling 11% achieved by the correlation coefficient test.

The right histogram is not quite as good, but at least it’s in the correct direction now! We estimate that it has a 30% chance of detecting the difference between a uniform and exponential distribution at the \alpha=5% level and a 50% chance of making that detection at the \alpha=10% level (because the two bars for the p-value less than 0.10 total over 500 of the 1000 iterations).

Conclusions

Thus, the problems with the correlation test are not due to some inherent difficulty in this setting. Not only does the correlation test perform very badly, it is bad compared to a widely known and available test. (I would guess that it is inadmissible, meaning that it will always perform worse, on the average, than the permutation version of the K-S test, implying there is no reason ever to use it.)

Attribution
Source : Link , Question Author : PascalVKooten , Answer Author : whuber

Leave a Comment