1

Say you have a list:

L1 = [milk, butter, bread, shampoo, dog food] 

and you want to know how similar this list is to another list

L2 = [milk, butter, shampoo, dog food, coffee] 

That is get the union of two lists:

Result = L1 U L2 

and Result is

[Milk, butter, dog food] 

Now, I know I can iterate over these and find the union. But given a list of size m and a list of size n. You will iterate this at least min(n, m) times. Given x lists, you have x^min(n,m) iterations which can get pricy.

I was thinking hashes could be the way, but I am not sure.

But if there was a way to minimize a list down to one string and compare it to another string.

that is H(L1) U H(L2) has x% in common?

Note that I don't actually need to know what were the items in common. Just that they have a percentage shared between the two.

3
  • 2
    Have you checked on the set builtin? It's probably the data type for what you're after. Commented Apr 8, 2014 at 1:43
  • 3
    You say Union, but your result is the intersection Commented Apr 8, 2014 at 1:44
  • 1
    What does this have to do with cryptography? Commented Apr 8, 2014 at 1:45

2 Answers 2

1

If you don't have duplicates in the two lists, you could use sets instead, which do use hashes internally -

>>> L1 = {'milk', 'butter', 'bread', 'shampoo', 'dog food'} >>> L2 = {'milk', 'butter', 'shampoo', 'dog food', 'coffee'} >>> L1 & L2 {'dog food', 'butter', 'shampoo', 'milk'} 

If you do need to handle duplicates, Python has a multiset in the form of collections.Counter, and its intersection operation does what you would expect:

>>> from collections import Counter >>> Counter(L1) & Counter(L2) Counter({'butter': 1, 'milk': 1, 'shampoo': 1, 'dog food': 1}) 

To get the 'x% in common' string, you would need to compare the total number of elements in the intersection to the number of elements you started with. Sets support len() in the same way lists do, so getting the number of items in common if you don't have duplicates is just len(L1 & L2). Taking the length of a Counter will only give you the number of distinct elements - to get the number of elements counted up to their multiplicity when L1 and L2 are Counters, you could do:

 common = L1 & L2 num_in_common = sum(common.values()) 
Sign up to request clarification or add additional context in comments.

Comments

0

That's exactly how sets work. Convert the lists to sets and you can take the union/intersection

S1 = set(L1) S2 = set(L2) result = S1.intersection(S2) 

This won't preserve the order though.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.