Compare similarities in a list using hashes?

Question

Say you have a list:

L1 = [milk, butter, bread, shampoo, dog food]

and you want to know how similar this list is to another list

L2 = [milk, butter, shampoo, dog food, coffee]

That is get the union of two lists:

Result = L1 U L2

and Result is

[Milk, butter, dog food]

Now, I know I can iterate over these and find the union. But given a list of size m and a list of size n. You will iterate this at least min(n, m) times. Given x lists, you have x^min(n,m) iterations which can get pricy.

I was thinking hashes could be the way, but I am not sure.

But if there was a way to minimize a list down to one string and compare it to another string.

that is H(L1) U H(L2) has x% in common?

Note that I don't actually need to know what were the items in common. Just that they have a percentage shared between the two.

Have you checked on the set builtin? It's probably the data type for what you're after. — g.d.d.c
– g.d.d.c, Commented Apr 8, 2014 at 1:43

lvc · Accepted Answer · 2014-04-08 01:54:34Z

If you don't have duplicates in the two lists, you could use sets instead, which do use hashes internally -

>>> L1 = {'milk', 'butter', 'bread', 'shampoo', 'dog food'} >>> L2 = {'milk', 'butter', 'shampoo', 'dog food', 'coffee'} >>> L1 & L2 {'dog food', 'butter', 'shampoo', 'milk'}

If you do need to handle duplicates, Python has a multiset in the form of collections.Counter, and its intersection operation does what you would expect:

>>> from collections import Counter >>> Counter(L1) & Counter(L2) Counter({'butter': 1, 'milk': 1, 'shampoo': 1, 'dog food': 1})

To get the 'x% in common' string, you would need to compare the total number of elements in the intersection to the number of elements you started with. Sets support len() in the same way lists do, so getting the number of items in common if you don't have duplicates is just len(L1 & L2). Taking the length of a Counter will only give you the number of distinct elements - to get the number of elements counted up to their multiplicity when L1 and L2 are Counters, you could do:

 common = L1 & L2 num_in_common = sum(common.values())

John La Rooy · Accepted Answer · 2014-04-08 01:44:03Z

That's exactly how sets work. Convert the lists to sets and you can take the union/intersection

S1 = set(L1) S2 = set(L2) result = S1.intersection(S2)

This won't preserve the order though.

Collectives™ on Stack Overflow

Compare similarities in a list using hashes?

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related