I am aware that cross posting isn't typically allowed on SE but I think this question, earlier posted on math SE, is worth considering on crypto SE.
(Since this post is just a copy of math SE question I am modifying it to Cryoto SE)
I have two list, say list1 and list2, containing random integers and both having size $N$. We have a fixed overlap ratio between them. If overlap ratio is $0.3$ and $N = 10$ it means both list have around $3$ integers in common. However this doesn’t guarantee that the indices for these elements will be same. In most of the cases they will be different.
I want help regarding the minimum number of brute force searches I need to perform to find an element that occurs in both list.
Since we need to find a common element, which is guaranteed by the overlap ratio, We start by creating two list, say $A$ and $B$, and add element into them from list1 and list2 until we have: $|A \cap B| = 1$. That is both sub lists have at least one element in common.
The final goal: Minimum number of length of sublists (number of elements in them) so that both have at-least one element in common.
Here is how sublist are created: We pick a random index, $i$, in range $(0, N)$ and create two sub lists: sublist1 and sublist2.
Step 1: Send list1[i] element to sublist1
Step 2: Send list2[i] element to sublist2
Step 3: Check if sublist1 and sublist2 have common elements irrespective of their indices in sublist. Assume that this step is also very simple and completes instantly.
Step 4: If no common element is found increase $i$ with $1$ and repeat. If $i = N +1$ set $i = 1$ again.
I wrote a python program with overlap ratio $0.3$ and here are the results
N Avg Trials % of N √N Ratio (Trials/√N) 120 17 14.17% 11 1.55 12,000 177 1.47% 110 1.61 100,000 539 0.539% 316 1.70 1,000,000 1,584 0.1584% 1,000 1.58 10,000,000 4,975 0.04975% 3,162 1.57 $N$ denotes the length of list. Then I ran the code multiple times and after running it multiple times I calculated the average trials needed that is in Avg Trials. % of N indicates the percentage of range needed to search (average).
I am not good at probability. This suggests that it might be following the birthday bound. If someone could help me with more details and help generalise these results. I seek help for following:
- Given $N = 10^{80}$ and overlap ratio $=0.3$ what will be the minimum number of trials that ensure there is atleast one element in both sublist?
- What will be the cost of implementing this (as asked in 1) and whether it's feasible?
- Whether this matching follows a birthday bound, if yes, then why?
Example:
List1 = $[52579, 46780, 36228, 40369, 26694, 36744, 9651, 98806, 5305, 64566]$
List2 = $[46780, 56910, 80182, 63250, 53981, 5305, 76771, 9651, 61251, 76094]$
Let $i = 2$
Trial 1
sublist1 = $[46780]$
sublist2 = $[56910]$
Since there is no matching element we include next elements
Thus $i = 2+1 = 3$.
sublist1 = $[46780, 36228]$
sublist2 = $[56910, 80182]$
Again there is no matching element so we keep including more additional element.
At 7th trial the picture looks like
sublist1 = $[46780, 36228, 40369, 26694, 36744, 9651, 98806]$
sublist2 = $[56910, 80182, 63250, 53981, 5305, 76771, 9651]$
Here we have $9651$ in sublist2 which occur in sublist1. Hence here the length is $7$.