Skip to main content

Minimum number of length of sublist so that they have one element in common

I am aware that cross posting isn't typically allowed on SE but I think this question, earlier posted on math SE, is worth considering on crypto SE.

(Since this post is just a copy of math SE question I am modifying it to Cryoto SE)

I have two list, say list1 and list2, containing random integers and both having size $N$. We have a fixed overlap ratio between them. If overlap ratio is $0.3$ and $N = 10$ it means both list have around $3$ integers in common. However this doesn’t guarantee that the indices for these elements will be same. In most of the cases they will be different.

I want help regarding the minimum number of brute force searches I need to perform to find an element that occurs in both list.

Since we need to find a common element, which is guaranteed by the overlap ratio, We start by creating two list, say $A$ and $B$, and add element into them from list1 and list2 until we have: $|A \cap B| = 1$. That is both sub lists have at least one element in common.

The final goal: Minimum number of length of sublists (number of elements in them) so that both have at-least one element in common.

Here is how sublist are created: We pick a random index, $i$, in range $(0, N)$ and create two sub lists: sublist1 and sublist2.

Step 1: Send list1[i] element to sublist1

Step 2: Send list2[i] element to sublist2

Step 3: Check if sublist1 and sublist2 have common elements irrespective of their indices in sublist. Assume that this step is also very simple and completes instantly.

Step 4: If no common element is found increase $i$ with $1$ and repeat. If $i = N +1$ set $i = 1$ again.

I wrote a python program with overlap ratio $0.3$ and here are the results

N Avg Trials % of N √N Ratio (Trials/√N) 120 17 14.17% 11 1.55 12,000 177 1.47% 110 1.61 100,000 539 0.539% 316 1.70 1,000,000 1,584 0.1584% 1,000 1.58 10,000,000 4,975 0.04975% 3,162 1.57 

$N$ denotes the length of list. Then I ran the code multiple times and after running it multiple times I calculated the average trials needed that is in Avg Trials. % of N indicates the percentage of range needed to search (average).

I am not good at probability. This suggests that it might be following the birthday bound. If someone could help me with more details and help generalise these results. I seek help for following:

  1. Given $N = 10^{80}$ and overlap ratio $=0.3$ what will be the minimum number of trials that ensure there is atleast one element in both sublist?
  2. What will be the cost of implementing this (as asked in 1) and whether it's feasible?
  3. Whether this matching follows a birthday bound, if yes, then why?

Example:

List1 = $[52579, 46780, 36228, 40369, 26694, 36744, 9651, 98806, 5305, 64566]$

List2 = $[46780, 56910, 80182, 63250, 53981, 5305, 76771, 9651, 61251, 76094]$

Let $i = 2$

Trial 1

sublist1 = $[46780]$

sublist2 = $[56910]$

Since there is no matching element we include next elements

Thus $i = 2+1 = 3$.

sublist1 = $[46780, 36228]$

sublist2 = $[56910, 80182]$

Again there is no matching element so we keep including more additional element.

At 7th trial the picture looks like

sublist1 = $[46780, 36228, 40369, 26694, 36744, 9651, 98806]$

sublist2 = $[56910, 80182, 63250, 53981, 5305, 76771, 9651]$

Here we have $9651$ in sublist2 which occur in sublist1. Hence here the length is $7$.