Goodness-of-fit for very large sample sizes

I collect very large samples (>1,000,000) of categorical data each day and want to see the data looks “significantly” different between days to detect errors in data collection.

I thought using a good-of-fit test (in particular, a G-test) would be a good fit (pun intended) for this. The expected distribution is given by the distribution of the previous day.

But, because my sample sizes are so large, the test has very high power and gives off many false positives. That is to say, even a very minor daily fluctuation will give a near-zero p-value.

I ended up multiplying my test statistic by some constant (0.001), which has the nice interpretation of sampling the data at that rate. This article seems to agree with this approach. They say that:

Chi square is most reliable with samples of between roughly 100 to
2500 people

I’m looking for some more authoritative comments about this. Or perhaps some alternative solutions to false positives when running statistical tests on large data sets.


The test is returning the correct result. The distributions are not the same from day to day. This is, of course, no use to you. The issue you are facing has been long known. See: Karl Pearson and R. A. Fisher on Statistical Tests: A 1935 Exchange from Nature

Instead you could look back at previous data (either yours or from somewhere else) and get the distribution of day to day changes for each category. Then you check if the current change is likely to have occurred given that distribution. It is difficult to answer more specifically without knowing about the data and types of errors, but this approach seems more suited to your problem.

Source : Link , Question Author : tskuzzy , Answer Author : Flask

Leave a Comment