Detecting subtle signals embedded in random patterns?

Question

Suppose one million random 10 digit binary numbers have been generated:

$$1011010101$$ $$1110010100$$ $$0110001010$$ $$...$$

A "bug" in the generator code is such that the $3^{rd}$ and $7^{th}$ digit of the sequence have an "extra" conditional algorithm tucked in:

If the parity of all digits to the left of the $3^{rd}$ digit is odd, and the parity of all digits to the right of the $7^{th}$ digit is even, then there is a 10% chance of the $7^{th}$ digit morphing to match the $3^{rd}$ digit value.

$$01\color{red}0010\color{red}1110$$

Suppose this bug is unknown to us.

Is there a systematic statistical way to discover the inherent non-randomness in these set of 1 million numbers, and if so, is there a way to discover which digits are problematic and the exact bug algorithm that produces them?

If you have no suspicions regarding that particular bug, it seems awfully hard to catch. — lulu
– lulu, Commented Apr 20, 2023 at 13:32
Systematic way that tries to catch any non-randomness? I don't think so. There are just so extremely many ways the system can be non-random, you can't hope to cover them all, and each test has to be tailor-made to the kind of non-randomness it tries to detect. Systematic way that tries to catch non-randomness of the kind "One of the ten bits is equal to one of the other ten bits slightly more often than $50\%$ of the time", sure. Given enough samples, you could find a way to systematically detect those with decent probability. — Arthur
– Arthur, Commented Apr 20, 2023 at 13:36
@Arthur yes, we can find the digit $3^{rd}$ and $7^{th}$ using occurence correlation. Once we find the anomaly, will there be a way to discover what algorithm causes this manifestation of the anomaly? — James
– James, Commented Apr 20, 2023 at 13:47
But are you seriously going to compute every possible pair correlation? What about triples? Groups of $91$? It certainly makes sense to run your "random" numbers through a suite of tests, but it will always be easy to find bugs that would elude any given suite of detectors. — lulu
– lulu, Commented Apr 20, 2023 at 13:58
Engineers have developed the list of Diehard tests that good random number generators should pass. @lulu How did that cleaning person get involved in the computation of $\pi\,?$ — Kurt G.
– Kurt G., Commented Apr 20, 2023 at 14:11

Henry · Accepted Answer · 2023-04-20 17:30:05Z

Most of your $1024$ strings will be expected to appear about $\frac{10^6}{2^{10}} \approx 977$ times (with a standard deviation about $31$) but for some the expected number is $10\%$ lower about $879$ and others $10\%$ higher about $1074$.

So if you did something like a chi-squared analysis then it would not surprise me if you were quite likely in this example to reject a hypothesis that all $1024$ strings/values were equally likely to appear.

If you then looked at the least and most common strings/values, you might spot that a disproportionate number of both were of the binary form $11xxxxx000$, i.e. in decimal at least $768$ and a multiple of $8$. It might take longer to estimate the further detail.

Here is an example in R:

set.seed(2023) sims <- sample(0:(2^10-1), 10^6, replace=TRUE) unirv <- runif(10^6) simsadjusted <- sims + ifelse(unirv > 0.1, 0, ifelse(sims %/% (2^7) == 7 & sims %% (2^4) == 0, 8, 0) + ifelse(sims %/% (2^7) == 6 & sims %% (2^4) == 8, -8, 0)) counts <- table(sims) chisq.test(counts) # before adjustment # Chi-squared test for given probabilities # # data: counts # X-squared = 1028.4, df = 1023, p-value = 0.4465 countsadjusted <- table(simsadjusted) chisq.test(countsadjusted) # after adjustment # Chi-squared test for given probabilities # # data: countsadjusted # X-squared = 1351.1, df = 1023, p-value = 1.898e-11 sortcountsadjusted <- sort(countsadjusted) names(head(sortcountsadjusted, 10)) # ten least common # "872" "840" "792" "896" "776" "1008" "856" "976" "912" "808" names(tail(sortcountsadjusted, 10)) # ten most common # "296" "832" "334" "984" "1016" "936" "920" "968" "904" "864"

(+1) In my opinion, a sample of $\ 10^6\ $ would be large enough to spot also that the higher and lower frequency numbers were the $256$ with $\ b_1\text{ xor }b_2=0\ $ and $\ b_8\text{ xor }b_9\text{ xor }b_{10}=1\ $ (where $\ b_i\ $ is the $\ i^\text{th}\ $ bit of the number), that the higher frequency numbers were the $128$ of these with $\ b_3\text{ xor }$$\,b_7=0\ $ and that the lower frequency ones were the $128$ of them with $\ b_3\text{ xor }$$\,b_7=1\ $, not, of course, with a very high degree of certainty, but at least as a good working hypothesis. — lonza leggiera
– lonza leggiera, Commented Apr 20, 2023 at 20:01
@lonzaleggiera: I think the tests apply to "all the digits" left and right so are $b_1\text{ and }b_2=1$ with $b_8\text{ or }b_9\text{ or }b_{10}=0$ so rather tighter than your suggestion. A sample $10^6$ is very like to show something is going on but note that in my particular example $334_{10}=0101001110_2$ was the $8$th most common value through sampling noise so I did not get perfect separation. — Henry
– Henry, Commented Apr 21, 2023 at 13:57
Rerunning my code on a different system annoyingly gives me different results despite the seed (above was run on rdrr.io) but the $11$th most common value using my PC was $252_{10}=0011111100_2$ so illustrating the same issue. — Henry
– Henry, Commented Apr 21, 2023 at 14:00
I wouldn't expect perfect separation with a sample of a million. But the frequencies of the higher and lower frequency numbers should differ from $\ \frac{10^6}{1024}\ $ by about $\ 3\ $ standard deviation, so I would expect it to be possible to identify the pattern. I do recognise, however, that when you've been told the pattern, armchair speculation about what you might be able to do is much easier than doing it in practise. — lonza leggiera
– lonza leggiera, Commented Apr 22, 2023 at 12:39
I didn't recognise the ambiguity in the description of the process until you pointed it out, although I did think it was a rather strange way of saying what I took it to mean. To me, however, it seems like an equally strange way of saying what you took it to mean, so I definitely think the meaning needs to be clarified. — lonza leggiera
– lonza leggiera, Commented Apr 22, 2023 at 12:44

Stack Exchange Network

Detecting subtle signals embedded in random patterns?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Detecting subtle signals embedded in random patterns?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions