Check if a character string is not random

Background
Let’s say we have an alphabet of A,B, C, D, then we look through some data and find a “word” which is DDDDDDDDCDDDDDD the chance of finding this random seems low to me whereas finding BABDCABCDACDBACD seems less random.

Question
How should I check whether the strings I encounter are not random?

I tried some things in R, e.g., encoding the letters numerically and then comparing these to permutations. But encoding beforehand is quite cumbersome. Likely there is a more direct approach for this?

Answer

the chance of finding this random seems low to me whereas finding BABDCABCDACDBACD seems less random.

Why would that be? If the overall proportion of letters A…D is equal to 0.25 for each letter, and each letter is independent of the other one, then both words are exactly equally probable. If the distribution of letters differ, then of course the probabilities of generating both words might be different.

You can try to find “low complexity” words, for example words with an especially high proportion of one letter (you could use the Shannon information as suggested in the other response, and in biological sequence analysis there are many other approaches), but there is no test for “randomness”, as without further assumptions or knowledge about what you are actually analyzing, the term “randomness” makes no sense.

Attribution
Source : Link , Question Author : CodeNoob , Answer Author : OrangeDog

Leave a Comment