Minimum number of length of sublist so that they have one element in common

Question

I am aware that cross posting isn't typically allowed on SE but I think this question, earlier posted on math SE, is worth considering on crypto SE.

(Since this post is just a copy of math SE question I am modifying it to Cryoto SE)

I have two list, say list1 and list2, containing random integers and both having size $N$. We have a fixed overlap ratio between them. If overlap ratio is $0.3$ and $N = 10$ it means both list have around $3$ integers in common. However this doesn’t guarantee that the indices for these elements will be same. In most of the cases they will be different.

I want help regarding the minimum number of brute force searches I need to perform to find an element that occurs in both list.

Since we need to find a common element, which is guaranteed by the overlap ratio, We start by creating two list, say $A$ and $B$, and add element into them from list1 and list2 until we have: $|A \cap B| = 1$. That is both sub lists have at least one element in common.

The final goal: Minimum number of length of sublists (number of elements in them) so that both have at-least one element in common.

Here is how sublist are created: We pick a random index, $i$, in range $(0, N)$ and create two sub lists: sublist1 and sublist2.

Step 1: Send list1[i] element to sublist1

Step 2: Send list2[i] element to sublist2

Step 3: Check if sublist1 and sublist2 have common elements irrespective of their indices in sublist. Assume that this step is also very simple and completes instantly.

Step 4: If no common element is found increase $i$ with $1$ and repeat. If $i = N +1$ set $i = 1$ again.

I wrote a python program with overlap ratio $0.3$ and here are the results

N Avg Trials % of N √N Ratio (Trials/√N) 120 17 14.17% 11 1.55 12,000 177 1.47% 110 1.61 100,000 539 0.539% 316 1.70 1,000,000 1,584 0.1584% 1,000 1.58 10,000,000 4,975 0.04975% 3,162 1.57

$N$ denotes the length of list. Then I ran the code multiple times and after running it multiple times I calculated the average trials needed that is in Avg Trials. % of N indicates the percentage of range needed to search (average).

I am not good at probability. This suggests that it might be following the birthday bound. If someone could help me with more details and help generalise these results. I seek help for following:

Given $N = 10^{80}$ and overlap ratio $=0.3$ what will be the minimum number of trials that ensure there is atleast one element in both sublist?
What will be the cost of implementing this (as asked in 1) and whether it's feasible?
Whether this matching follows a birthday bound, if yes, then why?

Example:

List1 = $[52579, 46780, 36228, 40369, 26694, 36744, 9651, 98806, 5305, 64566]$

List2 = $[46780, 56910, 80182, 63250, 53981, 5305, 76771, 9651, 61251, 76094]$

Let $i = 2$

Trial 1

sublist1 = $[46780]$

sublist2 = $[56910]$

Since there is no matching element we include next elements

Thus $i = 2+1 = 3$.

sublist1 = $[46780, 36228]$

sublist2 = $[56910, 80182]$

Again there is no matching element so we keep including more additional element.

At 7th trial the picture looks like

sublist1 = $[46780, 36228, 40369, 26694, 36744, 9651, 98806]$

sublist2 = $[56910, 80182, 63250, 53981, 5305, 76771, 9651]$

Here we have $9651$ in sublist2 which occur in sublist1. Hence here the length is $7$.

@madhurkant I think its enough to be considered here. This is equally mathematic and cryptographic. — esteregg
– esteregg, Commented Nov 30, 2024 at 12:34
Assuming this is considered to be on-topic, which I’m not convinced it is, the actual question is still ambiguous. Are we analyzing the algorithm provided, or trying to find the best algorithm to solve the problem? — Max
– Max, Commented Nov 30, 2024 at 17:45
I am asking you to modify it so it meets standards here as pointed out by @poncho. Its okay to accept that you don't know instead of arguing. Your question need significant changes and I will be doing it only if you ensure to accept my edit because doing so consumes time. — madhurkant
– madhurkant, Commented Dec 1, 2024 at 2:57
Let it be assumed that storing list as large as 10^80 is possible — esteregg
– esteregg, Commented Dec 1, 2024 at 6:00

kodlu · Accepted Answer · 2024-12-01 19:46:55Z

Given two lists $A,B$ each with size $N$ and overlap ratio $\alpha \in (0,1)$ i.e., $$\#A \cap B = \alpha N,$$

consider the random sequence of elements $X_i,i=1,2,\ldots$ and $Y_j,j=1,2,\ldots$ which are drawn to form two lists $L_A$ and $L_B$ so that the lists are sequentially $$\{X_1\},\{X_1,X_2\},\ldots,\{X_1,X_2,\ldots,X_t\}$$ and $$\{Y_1\},\{Y_1,Y_2\},\ldots,\{Y_1,Y_2,\ldots,Y_t\}.$$ Let the lists at stage $t$ be denoted $$L_A(t)=\{X_1,X_2,\ldots,X_t\}$$ and $$L_B(t)=\{Y_1,Y_2,\ldots,Y_t\}.$$

It is straightforward to give an average case analysis which will hold in a typical attack with high probability given the huge size of the list that you are considering. It is similar to the birthday paradox analysis which you can find on Wikipedia:

In the sequence $X_1,X_2,\ldots,$ each element $X_i$ is in the intersection with probability $\alpha.$ Similarly in the sequence $X_1,X_2,\ldots,$ each element $Y_i$ is in the intersection with probability $\alpha.$

So, at time $t,$ (after $t$ samples from each of the lists) there are on average $M_t=\alpha t$ samples (we may need to take the integer part but ignore for now) which are from each of $A$ and $B$. Denote these sets $I_A(t)$ and $I_B(t)$ and they are subsets of $L_A(t)$ and $L_B(t)$ respectively.

Now, we need to calculate the probability that $L_A(t)$ and $L_B(t)$ have no overlap. We use independence of samples throughout. For a fixed element in $L_A(t)$ the probability that it is not present in $L_B(t)$ is $$ \left(1-\frac{1}{\alpha N}\right)^{M_t} $$ since we are targeting a single element of $\alpha N$ elements in the intersection and doing this via $M_t$ independent trials.

This is now repeated $M_t$ times for each element in $L_A(t)$, i.e., $M_t$ times.

So the probability that there is no overlap up to and including stage $t$ is $$ P_{fail}(t)=\left(1-\frac{1}{\alpha N}\right)^{M_t^2}= \left[\left(1-\frac{1}{\alpha N}\right)^{\alpha N} \right]^{M_t^2/\alpha N} \sim \exp\left(-M_t^2/\alpha N\right). $$ Recall that $M_t = \alpha t,$ so $P_{fail}(t)\sim \exp\left(-\alpha t^2/N\right).$ To make this probability of fail $1/e\approx 37\%$ we choose $t=\sqrt{N/\alpha}.$

By the way for your numbers (I don't think memory of $10^{80}$ is realistic by the way) this gives $t=\sqrt{10N/3}=\sqrt{10^{81}/3}\approx 1.83 \times 10^{40}.$

Edit: This is not a practical attack with given parameters. A zetabyte (EB) is $10^{21}$ bytes and is so large that all the data in the world is just a few zetabytes. As for adaptation of the attack, ven with some improvements it's unlikely to obtain a practical attack. Ignoring the big lists, just think of searching a collision in the overlap set. Birthday algorithm of sequential search is optimal in this case. You cannot get away from the required complexity being roughly $\sqrt{N\alpha}$ to find how the probability varies, reduce the number of samples to $M<M_t$ and plug into the first exponential and see how quickly probability of failure approaches $1$.

Thanks for the analysis kodlu. This analysis is insightful. One more question: Is this doable? $10^{40}$ is a humongous number. The storage will run out. Actually the source of question was a cryptographic algorithm, I am testing, of $256$ bits security. The output produced in such a scheme was shared with it's weak implementation with overlap ratio 0.3. That is why I was asking whether we can find a similar element in a specified range thus allowing us to compromise security — esteregg
– esteregg, Commented Dec 1, 2024 at 11:34

Stack Exchange Network

Minimum number of length of sublist so that they have one element in common

1 Answer 1

Hot Network Questions

Minimum number of length of sublist so that they have one element in common

1 Answer 1

Related

Hot Network Questions