# Impact of data-based bin boundaries on a chi-square goodness of fit test?

Leaving aside the obvious issue of the low power of the chi-square in this sort of circumstance, imagine doing a chi-square goodness of test for some density with unspecified parameters, by binning the data.

For concreteness, let’s say an exponential distribution with unknown mean and a sample size of say 100.

In order to get a reasonable number of expected observations per bin some account would need to be taken of the data (e.g. if we chose to put 6 bins below the mean and 4 above it, that would still be using data-based bin boundaries).

But this use of bins based on seeing the data would presumably affect the distribution of the test statistic under the null.

I have seen plenty of discussion about the fact that – if the parameters are estimated by maximum likelihood from the binned data – you lose 1 d.f per estimated parameter (an issue dating right back to Fisher vs Karl Pearson) – but I don’t recall reading anything about finding the bin boundaries themselves based on the data. (If you estimate them from the unbinned data, then with $k$ bins the distribution of the test statistic lies somewhere between a $\chi^2_{k}$ and a $\chi^2_{k-p}$.)

Does this data-based choice of bins substantively impact significance level or power? Are there some approaches that matter more than others? If there is much of an effect, is it something that goes away in large samples?

If it does have a substantive impact, this would seem to make the use of a chi-squared test when parameters are unknown almost useless in many cases (in spite of still being advocated in quite a few texts), unless you had a good a-priori estimate of the parameter.

Discussion of the issues or pointers to references (preferably with a mention of their conclusions) would be useful.

Edit, pretty much an aside to the main question:

It occurs to me that there are potential solutions for the specific case of the exponential* (and the uniform come to think of it), but I am still interested in the more general issue of the impact choosing bin boundaries.

* For example, for the exponential, one might use the smallest observation (say it is equal to $m$) to get a very rough idea of where to place the bins (since the smallest observation is exponential with mean $\mu/n$), and then test the remaining $n-1$ differences ($x_i - m$) for exponentiality. Of course that might yield a very poor estimate of $\mu$, and hence poor bin choices, though I suppose one might use the argument recursively in order to take the lowest two or three observations from which to choose reasonable bins and then test the differences of the remaining observations above the largest of those smallest order statistics for exponentiality)

The basic results of chi-square goodness-of-fit testing can be understood hierarchically.

Level 0. The classical Pearson’s chi-square test statistic for testing a multinomial sample against a fixed probability vector $p$ is

where $X_i^{(n)}$ denotes the number of outcomes in the $i$th cell out of a sample of size $n$. This can be fruitfully viewed as the squared norm of the vector $\mathbf Y_n = (Y_1^{(n)},\ldots,Y_k^{(n)})$ where $Y_i^{(n)} = (X_i^{(n)} - n p_i)/\sqrt{n p_i}$ which, by the multivariate central limit theorem converges in distribution as

From this we see that $X^2 = \|\mathbf Y_n\|^2 \to \chi^2_{k-1}$ since $\mathbf I - \sqrt{p}\sqrt{p}^T$ is idempotent of rank $k-1$.

Level 1. At the next level of the hierarchy, we consider composite hypotheses with multinomial samples. Since the exact $p$ of interest is unknown under the null hypothesis, we have to estimate it. If the null hypothesis is composite and composed of a linear subspace of dimension $m$, then maximum likelihood estimates (or other efficient estimators) of the $p_i$ can be used as “plug-in” estimators. Then, the statistic

under the null hypothesis.

Level 2. Consider the case of goodness of fit testing of a parametric model where the cells are fixed and known in advance: For example, we have a sample from an exponential distribution with rate $\lambda$ and from this we produce a multinomial sample by binning over $k$ cells, then the above result still holds provided that we use efficient estimates (e.g., MLEs) of the bin probabilities themselves using only the observed frequencies.

If the number of parameters for the distribution is $m$ (e.g., $m = 1$ in the exponential case), then

where here $\hat{p}_i$ can be taken to be the MLEs of the cell probabilities of the fixed, known cells corresponding to the given distribution of interest.

Level 3. But, wait! If we have a sample $Z_1,\ldots,Z_n \sim F_\lambda$, why shouldn’t we estimate $\lambda$ efficiently first, and then use a chi-square statistic with our fixed, known cells? Well, we can, but in general we no longer get a chi-square distribution for the corresponding chi-square statistic. In fact, Chernoff and Lehmann (1954) showed that using MLEs to estimate the parameters and then plugging them back in to get estimates of the cell probabilities results in a non-chi-square distribution, in general. Under suitable regularity conditions, the distribution is (stochastically) between a $\chi_{k-m-1}^2$ and a $\chi_{k-1}^2$ random variable, with the distribution depending on the parameters.

Untuitively, this means that the limiting distribution of $\mathbf Y_n$ is $\mathcal N(0, \mathbf I - \sqrt{p_\lambda}\sqrt{p_\lambda}^T - \mathbf A(\lambda))$.

We haven’t even talked about random cell boundaries yet, and we’re already in a bit of a tight spot! There are two ways out: One is to retreat back to Level 2, or at the very least not use efficient estimators (like MLEs) of the underlying parameters $\lambda$. The second approach is to try to undo the effects of $\mathbf A(\lambda)$ in such a way as to recover a chi-square distribution.

There are several ways of going the latter route. They basically amount to premultiplying $\mathbf Y_n$ by the “right” matrix $\mathbf B(\hat{\lambda})$. Then, the quadratic form

where $k$ is the number of cells.

Examples are the Rao–Robson–Nikulin statistic and the Dzhaparidze–Nikulin statistic.

Level 4. Random cells. In the case of random cells, under certain regularity conditions, we end up in the same situation as in Level 3 if we take the route of modifying the Pearson chi-square statistic. Location-scale families, in particular, behave very nicely. One common approach is to take our $k$ cells each to have probability $1/k$, nominally. So, our random cells are intervals of the form $\hat{I}_j = \hat \mu + \hat\sigma I_{0,j}$ where $I_{0,j} = [F^{-1}((j-1)/k), F^{-1}(j/k))$. This result has been further extended to the case where the number of random cells grows with the sample size.

References

1. A W. van der Vaart (1998), Asymptotic Statistics, Cambridge University Press. Chapter 17: Chi-Square Tests.

2. H. Chernoff and E. L. Lehmann (1954), The use of maximum likelihood estimates in $\chi^2$ tests for goodness of fit, Ann. Math. Statist., vol. 25, no. 3, 579–586.

3. F. C. Drost (1989), Generalized chi-square goodness-of-fit tests for location-scale models when the number of classes tends to infinity, Ann. Stat, vol. 17, no. 3, 1285–1300.

4. M. S. Nikulin, M.S. (1973), Chi-square test for continuous distribution with
shift and scale parameters
, Theory of Probability and its Application, vol. 19, no. 3, 559–568.

5. K. O. Dzaparidze and M. S. Nikulin (1973), On a modification of the standard statistics of Pearson, Theory of Probability and its Application, vol. 19, no. 4, 851–853.

6. K. C. Rao and D. S. Robson (1974), A chi-square statistic for goodness of fit tests within exponential family, Comm. Statist., vol 3., no. 12, 1139–1153.

7. N. Balakrishnan, V. Voinov and M. S. Nikulin (2013), Chi-Squared Goodness of Fit Tests With Applications, Academic Press.