25

I have two lists with usernames and I want to calculate the Jaccard similarity. Is it possible?

This thread shows how to calculate the Jaccard Similarity between two strings, however I want to apply this to two lists, where each element is one word (e.g., a username).

10 Answers 10

40

I ended up writing my own solution after all:

def jaccard_similarity(list1, list2): intersection = len(list(set(list1).intersection(list2))) union = (len(set(list1)) + len(set(list2))) - intersection return float(intersection) / union 
Sign up to request clarification or add additional context in comments.

5 Comments

The function will always return 0.0
@xyd Works perfect for me. Can you please explain?
Worth noting this calculation is different than the answer by @w2bo as this one does not divide by the set length union.
This answer is wrong. For example, jaccard_similarity([1], [0, 1]) -> 0.5 and jaccard_similarity([1, 1], [0, 1, 1]) -> 0.25 however second one should be as similar or more similar than first one based on how you define the jaccard.
The solution is simple and elegant, but not 100% correct. You should change the corresponding line to : union = (len(set(list1)) + len(set(list2))) - intersection
32

For Python 3:

def jaccard_similarity(list1, list2): s1 = set(list1) s2 = set(list2) return float(len(s1.intersection(s2)) / len(s1.union(s2))) list1 = ['dog', 'cat', 'cat', 'rat'] list2 = ['dog', 'cat', 'mouse'] jaccard_similarity(list1, list2) >>> 0.5 

For Python2 use return len(s1.intersection(s2)) / float(len(s1.union(s2)))

2 Comments

This will also give 0.0 as result. Return statement should be modified : return float(len(s1.intersection(s2))) / float(len(s1.union(s2)))
For Python2 use: return float(len(s1.intersection(s2))) / len(s1.union(s2))
14

@aventinus I don't have enough reputation to add a comment to your answer, but just to make things clearer, your solution measures the jaccard_similarity but the function is misnamed as jaccard_distance, which is actually 1 - jaccard_similarity

1 Comment

Thank you for the tip! I did not know that. I edited the answer accordingly.
7

Assuming your usernames don't repeat, you can use the same idea:

def jaccard(a, b): c = a.intersection(b) return float(len(c)) / (len(a) + len(b) - len(c)) list1 = ['dog', 'cat', 'rat'] list2 = ['dog', 'cat', 'mouse'] # The intersection is ['dog', 'cat'] # union is ['dog', 'cat', 'rat', 'mouse] words1 = set(list1) words2 = set(list2) jaccard(words1, words2) >>> 0.5 

Comments

4

You can use the Distance library

#pip install Distance import distance distance.jaccard("decide", "resize") # Returns 0.7142857142857143 

1 Comment

This answer describes how to get the Jaccard similarity between two strings which is not what this question is about.
4

@Aventinus (I also cannot comment): Note that Jaccard similarity is an operation on sets, so in the denominator part it should also use sets (instead of lists). So for example jaccard_similarity('aa', 'ab') should result in 0.5.

def jaccard_similarity(list1, list2): intersection = len(set(list1).intersection(list2)) union = len(set(list1)) + len(set(list2)) - intersection return intersection / union 

Note that in the intersection, there is no need to cast to list first. Also, the cast to float is not needed in Python 3.

Comments

2

Creator of the Simphile NLP text similarity package here. Simphile contains several text similarity methods, Jaccard being one of them.

In the terminal install the package:

pip install simphile 

Then your code could be something like:

from simphile import jaccard_list_similarity list_a = ['cat', 'cat', 'dog'] list_b = ['dog', 'dog', 'cat'] print(f"Jaccard Similarity: {jaccard_list_similarity(list_a, list_b)}") 

The output being:

Jaccard Similarity: 0.5 

Note that this solution accounts for repeated elements -- critical for text similarity; without it, the above example would show 100% similarity due to the fact that both lists as sets would reduce to {'dog', 'cat'}.

Comments

1

If you'd like to include repeated elements, you can use Counter, which I would imagine is relatively quick since it's just an extended dict under the hood:

from collections import Counter def jaccard_repeats(a, b): """Jaccard similarity measure between input iterables, allowing repeated elements""" _a = Counter(a) _b = Counter(b) c = (_a - _b) + (_b - _a) n = sum(c.values()) return n/(len(a) + len(b) - n) list1 = ['dog', 'cat', 'rat', 'cat'] list2 = ['dog', 'cat', 'rat'] list3 = ['dog', 'cat', 'mouse'] jaccard_repeats(list1, list3) >>> 0.75 jaccard_repeats(list1, list2) >>> 0.16666666666666666 jaccard_repeats(list2, list3) >>> 0.5 

2 Comments

I think this solution is not correct as regards repeated items. However, it works ok for lists with non-repeated items.
I think that this is distance, so if one want similarity, '1 - ' should be removed from return line.
1

To avoid repetition of elements in the union (denominator), and a little bit faster I propose:

def Jaccar_score(lista1, lista2): inter = len(list(set(lista_1) & set(lista_2))) union = len(list(set(lista_1) | set(lista_2))) return inter/union 

Comments

1

⚠️ Attention:

  1. The Jaccard index (or Jaccard similarity coefficient) is a similarity measure of sets (unordered collections of unique elements) and not of lists (ordered collections of elements)! That means using the Jaccard index, as the question suggests, will lead to wrong and misleading results since the string will be interpreted as set (removing order and duplications).

  2. The Jaccard index is not defined for empty sets!

The Jaccard index is defined as the size of the intersection divided by the size of the union of two sets.

If you would like to use the Jaccard index for a different purpose, I would implement it exactly following the definition:

def jaccard_index(s_1: set, s_2: set): return len(s_1 & s_2) / len(s_1 | s_2) 

This implementation raises an exception for an empty union of both sets (i.e., both sets are empty).

If you look for a string similarity measure, you might need other measures like the Hamming distance, Levenshtein distance, or the generalization edit distance.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.