6

There are about 98,000 sentences (length from 5 - 100 words) in lst_train and about 1000 sentences (length from 5 - 100 words) in lst_test. For each sentence in lst_test I want to find if it's plagiarized from a sentence in lst_train. If the sentence is plagiarized, I should return the id in lst_train or else null.

Now I want to compute the jaccard similarity of each sentence in lst_test relative to each sentence in lst_train. Here's my code, b.JaccardSim computes two sentences' jaccard similarity:

lst_all_p = [] for i in range(len(lst_test)): print('i:', i) lst_p = [] for j in range(len(lst_train)): b = textSimilarity.TextSimilarity(lst_test[i], lst_train[j]) lst_p.append(b.JaccardSim(b.str_a,b.str_b)) lst_all_p.append(lst_p) 

But I found that each time of computing one sentence with each sentence in lst_train is more than 1 minutes. Since there are about 1000 sentences, it may take about 1000 minutes to finish it. It is too long.

Do you guys know how to make the computing speed faster or better method to solve the issue to detect the sentence is plagiarized from a sentence in lst_train?

2
  • Text-based similarity metrics is very slow. Try to vectorize text data and use cosine similarity metric, or some methods based on it. Commented Jun 13, 2018 at 12:48
  • hi@CrazyElf, thanks for your comments. Now I am using the word2vec to do it. Commented Jun 14, 2018 at 8:00

2 Answers 2

6

It's probably better to change your approach. Jaccard Similarity isn't super computationally intesive, but if you have to do it for every element in your dataset any non-trivial similarity calculation will be slow.

If you want to find plagiarism you should looks into near-duplicate detection and locality sensitive hashing. MinHash and SimHash are good starting points, and the datasketch library may also be helpful.

Note that for many applications vectorizing your sentences and searching for close sentences in vector space is effective. However, this is effective because it's able to understand synonyms - since you're looking for plagiarism you're looking for exact copies, so word vector based methods could end up working against your goal.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for your comments, I have send you email, please check it.
1

As suggested by polm23 the datasektch library is a good fit for your task.

This GitHub repo shows an example where the LSH implementation provided by datasketch is used to compute Jaccard similarity between text documents.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.