# Statistical hypothesis testing for empirical measurement

I’ve did practical experiment based on my question that I’ve posted a while ago. The goal is to distinguish which of the two sequences of 0 and 1 was generated by true random generator (e.g. coin flip) and which sequence was generated by human pretending the random behavior. This topic is discussed in the video about frequency stability property. Basically, the video says that instead of counting the occurrences of 0 and 1 separately the distinction between “true” random and “human” random can be done by sliding the window of length 3 over both inputs and counting the occurrences of the sub-sequences appeared in this window. “True” random generator should have all sub-sequences in the window equally likely, whereas the “human” random generator should not. Here is what I mean (input is in decimal not in binary for better understanding, but I hope you get the point, perl code used for generating histograms with sliding window is here):

INPUT:
1 2 3 4 5 6 7 8
WIN SIZE:3, STEP: 2
1 2 3
3 4 5
5 6 7

INPUT:
1 2 3 4 5 6 7 8
WIN SIZE:3, STEP: 1
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8


Datasets:

There are two datasets, both containing 1000 numbers (1 and 0):

The histograms for both datasets:

1st dataset (random.org):
0: 471
1: 529

2nd dataset (human):
0: 518
1: 482


Experiments:

1st Experiment: WINDOW = 3, STEP = 1
####################################
1st dataset (random.org):
000: 93
001: 118
010: 133
011: 126
100: 118
101: 142
110: 126
111: 142

2nd dataset (human):
000: 33
001: 98
010: 335
011: 51
100: 98
101: 289
110: 51
111: 43

2nd Experiment: WINDOW = 3, STEP = 2
####################################
1st dataset (random.org):
000: 47
001: 57
010: 67
011: 57
100: 60
101: 78
110: 55
111: 78

2nd dataset (human):
000: 17
001: 48
010: 151
011: 21
100: 49
101: 166
110: 21
111: 26

3rd Experiment: WINDOW = 3, STEP = 3
####################################
1st dataset (random.org):
000: 31
001: 37
010: 51
011: 41
100: 40
101: 46
110: 35
111: 52

2nd dataset (human):
000: 13
001: 31
010: 116
011: 21
100: 27
101: 95
110: 15
111: 15


Three questions here:

1. As you can see the “true” random generator has more uniform distribution than “human” random generator, Is there any measurement for this disproportion, or is there any threshold that can reliably distinguish between two histograms?
2. Is this concrete example sufficient to distinguish between “true” and human generator, or are there any other methods?
3. Can I somehow construct statistical hypothesis about this measurement and test it (would be nice if somebody will show how to do it step by step, because I’m noob in this)?

PS:
In my original question I get the answer that the window should not overlap. But my understanding is that if some string is truly random then there is equally likely probability that the string will contain equally likely sub-sequences of any window size with any given step size. Also here I’ve posted example with overlapping window (1st and 2nd experiments) and seems it has not impact on results. I’ve asked about this and it was not answered, so this is the main reason why the answer was not accepted yet.

Consider using a chi squared goodness of fit test. It would be best if you used step=3 so the observations are independent. The test is explained on wikipedia where the observed are the counts you provided, and the expected counts are the total count divided by the 8 buckets. I’ve provided R code and results for step=3. R takes the probabilities of the each bucket and not counts.

expected_prob <- rep(1/ length(obs), length(obs))
obs <- c(13, 31, 116, 21, 27, 95, 15, 15)
chisq.test(obs, p=expected_prob)
obs2 <- c(31, 37, 51, 42, 40, 46, 35, 52)
chisq.test(obs2, p=expected_prob)


The random number generator has p-value=$0.22$ and the human generator has p-value < $2.2e^{-16}$. For this test, the null hypothesis $H_0$ is that the counts are all equal, i.e. generated by a random number generator. $H_a$ is that the counts are not all equal, i.e. generated by a human. A low p-value (usually less than 0.05) is evidence against $H_0$ and it would be rejected. However, there is no perfect random number generator. As the string gets longer and longer, it is possible to fail the test.

There is a lot written about on creating random number generators and testing how good they are. Since you’re string is straight binary, a common one is the runs test. A run is defined as a consecutive 1s or 0s. So 011 is two runs. The limiting distribution of the number of runs in a sequence is known and a z-test can be performed. I think this test might have the advantage of working better with longer strings.

Whether a human can beat either of these two tests is up to you. Personally, I think it is unlikely unless they know which tests are being run.