4
$\begingroup$

Suppose one million random 10 digit binary numbers have been generated:

$$1011010101$$ $$1110010100$$ $$0110001010$$ $$...$$

A "bug" in the generator code is such that the $3^{rd}$ and $7^{th}$ digit of the sequence have an "extra" conditional algorithm tucked in:

If the parity of all digits to the left of the $3^{rd}$ digit is odd, and the parity of all digits to the right of the $7^{th}$ digit is even, then there is a 10% chance of the $7^{th}$ digit morphing to match the $3^{rd}$ digit value.

$$01\color{red}0010\color{red}1110$$

Suppose this bug is unknown to us.

Is there a systematic statistical way to discover the inherent non-randomness in these set of 1 million numbers, and if so, is there a way to discover which digits are problematic and the exact bug algorithm that produces them?

$\endgroup$
15
  • 1
    $\begingroup$ If you have no suspicions regarding that particular bug, it seems awfully hard to catch. $\endgroup$ Commented Apr 20, 2023 at 13:32
  • 1
    $\begingroup$ Systematic way that tries to catch any non-randomness? I don't think so. There are just so extremely many ways the system can be non-random, you can't hope to cover them all, and each test has to be tailor-made to the kind of non-randomness it tries to detect. Systematic way that tries to catch non-randomness of the kind "One of the ten bits is equal to one of the other ten bits slightly more often than $50\%$ of the time", sure. Given enough samples, you could find a way to systematically detect those with decent probability. $\endgroup$ Commented Apr 20, 2023 at 13:36
  • $\begingroup$ @Arthur yes, we can find the digit $3^{rd}$ and $7^{th}$ using occurence correlation. Once we find the anomaly, will there be a way to discover what algorithm causes this manifestation of the anomaly? $\endgroup$ Commented Apr 20, 2023 at 13:47
  • 1
    $\begingroup$ But are you seriously going to compute every possible pair correlation? What about triples? Groups of $91$? It certainly makes sense to run your "random" numbers through a suite of tests, but it will always be easy to find bugs that would elude any given suite of detectors. $\endgroup$ Commented Apr 20, 2023 at 13:58
  • 1
    $\begingroup$ Engineers have developed the list of Diehard tests that good random number generators should pass. @lulu How did that cleaning person get involved in the computation of $\pi\,?$ $\endgroup$ Commented Apr 20, 2023 at 14:11

1 Answer 1

3
$\begingroup$

Most of your $1024$ strings will be expected to appear about $\frac{10^6}{2^{10}} \approx 977$ times (with a standard deviation about $31$) but for some the expected number is $10\%$ lower about $879$ and others $10\%$ higher about $1074$.

So if you did something like a chi-squared analysis then it would not surprise me if you were quite likely in this example to reject a hypothesis that all $1024$ strings/values were equally likely to appear.

If you then looked at the least and most common strings/values, you might spot that a disproportionate number of both were of the binary form $11xxxxx000$, i.e. in decimal at least $768$ and a multiple of $8$. It might take longer to estimate the further detail.

Here is an example in R:

set.seed(2023) sims <- sample(0:(2^10-1), 10^6, replace=TRUE) unirv <- runif(10^6) simsadjusted <- sims + ifelse(unirv > 0.1, 0, ifelse(sims %/% (2^7) == 7 & sims %% (2^4) == 0, 8, 0) + ifelse(sims %/% (2^7) == 6 & sims %% (2^4) == 8, -8, 0)) counts <- table(sims) chisq.test(counts) # before adjustment # Chi-squared test for given probabilities # # data: counts # X-squared = 1028.4, df = 1023, p-value = 0.4465 countsadjusted <- table(simsadjusted) chisq.test(countsadjusted) # after adjustment # Chi-squared test for given probabilities # # data: countsadjusted # X-squared = 1351.1, df = 1023, p-value = 1.898e-11 sortcountsadjusted <- sort(countsadjusted) names(head(sortcountsadjusted, 10)) # ten least common # "872" "840" "792" "896" "776" "1008" "856" "976" "912" "808" names(tail(sortcountsadjusted, 10)) # ten most common # "296" "832" "334" "984" "1016" "936" "920" "968" "904" "864" 
$\endgroup$
6
  • $\begingroup$ (+1) In my opinion, a sample of $\ 10^6\ $ would be large enough to spot also that the higher and lower frequency numbers were the $256$ with $\ b_1\text{ xor }b_2=0\ $ and $\ b_8\text{ xor }b_9\text{ xor }b_{10}=1\ $ (where $\ b_i\ $ is the $\ i^\text{th}\ $ bit of the number), that the higher frequency numbers were the $128$ of these with $\ b_3\text{ xor }$$\,b_7=0\ $ and that the lower frequency ones were the $128$ of them with $\ b_3\text{ xor }$$\,b_7=1\ $, not, of course, with a very high degree of certainty, but at least as a good working hypothesis. $\endgroup$ Commented Apr 20, 2023 at 20:01
  • $\begingroup$ @lonzaleggiera: I think the tests apply to "all the digits" left and right so are $b_1\text{ and }b_2=1$ with $b_8\text{ or }b_9\text{ or }b_{10}=0$ so rather tighter than your suggestion. A sample $10^6$ is very like to show something is going on but note that in my particular example $334_{10}=0101001110_2$ was the $8$th most common value through sampling noise so I did not get perfect separation. $\endgroup$ Commented Apr 21, 2023 at 13:57
  • $\begingroup$ Rerunning my code on a different system annoyingly gives me different results despite the seed (above was run on rdrr.io) but the $11$th most common value using my PC was $252_{10}=0011111100_2$ so illustrating the same issue. $\endgroup$ Commented Apr 21, 2023 at 14:00
  • $\begingroup$ I wouldn't expect perfect separation with a sample of a million. But the frequencies of the higher and lower frequency numbers should differ from $\ \frac{10^6}{1024}\ $ by about $\ 3\ $ standard deviation, so I would expect it to be possible to identify the pattern. I do recognise, however, that when you've been told the pattern, armchair speculation about what you might be able to do is much easier than doing it in practise. $\endgroup$ Commented Apr 22, 2023 at 12:39
  • $\begingroup$ I didn't recognise the ambiguity in the description of the process until you pointed it out, although I did think it was a rather strange way of saying what I took it to mean. To me, however, it seems like an equally strange way of saying what you took it to mean, so I definitely think the meaning needs to be clarified. $\endgroup$ Commented Apr 22, 2023 at 12:44

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.