There are about 98,000 sentences (length from 5 - 100 words) in lst_train and about 1000 sentences (length from 5 - 100 words) in lst_test. For each sentence in lst_test I want to find if it's plagiarized from a sentence in lst_train. If the sentence is plagiarized, I should return the id in lst_train or else null.
Now I want to compute the jaccard similarity of each sentence in lst_test relative to each sentence in lst_train. Here's my code, b.JaccardSim computes two sentences' jaccard similarity:
lst_all_p = [] for i in range(len(lst_test)): print('i:', i) lst_p = [] for j in range(len(lst_train)): b = textSimilarity.TextSimilarity(lst_test[i], lst_train[j]) lst_p.append(b.JaccardSim(b.str_a,b.str_b)) lst_all_p.append(lst_p) But I found that each time of computing one sentence with each sentence in lst_train is more than 1 minutes. Since there are about 1000 sentences, it may take about 1000 minutes to finish it. It is too long.
Do you guys know how to make the computing speed faster or better method to solve the issue to detect the sentence is plagiarized from a sentence in lst_train?