Python: How to compute Jaccard Similarity more quickly

Question

There are about 98,000 sentences (length from 5 - 100 words) in lst_train and about 1000 sentences (length from 5 - 100 words) in lst_test. For each sentence in lst_test I want to find if it's plagiarized from a sentence in lst_train. If the sentence is plagiarized, I should return the id in lst_train or else null.

Now I want to compute the jaccard similarity of each sentence in lst_test relative to each sentence in lst_train. Here's my code, b.JaccardSim computes two sentences' jaccard similarity:

lst_all_p = [] for i in range(len(lst_test)): print('i:', i) lst_p = [] for j in range(len(lst_train)): b = textSimilarity.TextSimilarity(lst_test[i], lst_train[j]) lst_p.append(b.JaccardSim(b.str_a,b.str_b)) lst_all_p.append(lst_p)

But I found that each time of computing one sentence with each sentence in lst_train is more than 1 minutes. Since there are about 1000 sentences, it may take about 1000 minutes to finish it. It is too long.

Do you guys know how to make the computing speed faster or better method to solve the issue to detect the sentence is plagiarized from a sentence in lst_train?

Text-based similarity metrics is very slow. Try to vectorize text data and use cosine similarity metric, or some methods based on it. — CrazyElf
– CrazyElf, Commented Jun 13, 2018 at 12:48
hi@CrazyElf, thanks for your comments. Now I am using the word2vec to do it. — tktktk0711
– tktktk0711, Commented Jun 14, 2018 at 8:00

polm23 · Accepted Answer · 2018-06-13 12:51:51Z

It's probably better to change your approach. Jaccard Similarity isn't super computationally intesive, but if you have to do it for every element in your dataset any non-trivial similarity calculation will be slow.

If you want to find plagiarism you should looks into near-duplicate detection and locality sensitive hashing. MinHash and SimHash are good starting points, and the datasketch library may also be helpful.

Note that for many applications vectorizing your sentences and searching for close sentences in vector space is effective. However, this is effective because it's able to understand synonyms - since you're looking for plagiarism you're looking for exact copies, so word vector based methods could end up working against your goal.

thanks for your comments, I have send you email, please check it.

Aviral · Accepted Answer · 2020-06-09 20:58:30Z

As suggested by polm23 the datasektch library is a good fit for your task.

This GitHub repo shows an example where the LSH implementation provided by datasketch is used to compute Jaccard similarity between text documents.

Collectives™ on Stack Overflow

Python: How to compute Jaccard Similarity more quickly

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related