Levenshtein distance between each pair of elements from two large data sets

Question

I am currently working on a problem to find the best matches of data in a List namely "ListA" from another List called "ListB". Whenever I find a match of an element for "ListA" with any element in "ListB" which has a confidence and accuracy with 70% or greater I add the matched string from List B and the string in List A to a tuple which I further save in a database.

Levenshtein algorithm gives me the distance between the word in ListA and words in ListB, I use the distance to calculate the similarity between the words and compare them with my threshold of 70% and if the values returned is equal or greater than the 70 percent threshold then I add the results which are either 70% or greater to the tuple.

The code that I have written for this process works fine if the records in "ListA" and "ListB" are within thousands of values and if I increase the records to a million it takes about an hour to calculate the distance for each element of the List A.

I need to optimize the process for huge data sets. Please advise where do I need to make the improvements.

My code for the process so far looks like this

 public static PerformFuzzyMatch() { // Fetch the ListA & List B from SQL tables var ListACount = await FuzzyMatchRepo.FetchListACount(); var ListB = await FuzzyMatchRepo.FetchListBAsync(); //Split the ListA data to smaller chunks and loop through those chunks var splitGroupSize = 1000; var sourceDataBatchesCount = ListACount / splitGroupSize; // Loop through the smaller chunks of List A for (int b = 0; b < sourceDataBatchesCount; b++) { var currentBatchMatchedWords = new List<Tuple<string, string, double>>(); int skipRowCount = b * splitGroupSize; int takeRowCount = splitGroupSize; // Get chunks of data from ListA according to the skipRowCount and takeRowCount var currentSourceDataBatch = FuzzyMatchRepository.FetchSourceDataBatch(skipRowCount, takeRowCount); //Loop through the ListB and parallely calculate the distance between chunks of List A and List B data for (int i = 0; i < ListB.Count; i++) { Parallel.For( 0, currentSourceDataBatch.Count, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 }, cntr => { try { // call the Levenshtein Algorithm to calculate the distance between each element of ListB and the smaller chunk of List A. int leven = LevenshteinDistance(currentSourceDataBatch[cntr], ListB[i]); int length = Math.Max(currentSourceDataBatch[cntr].Length, ListB[i].Length); double similarity = double similarity = 1.0 - (double)leven / length; if ((similarity * 100) >= 70) { currentBatchMatchedWords.Add(Tuple.Create(currentSourceDataBatch[cntr], ListB[i], similarity)); } cntr++; } catch (Exception ex) { exceptions.Enqueue(ex); } }); } } }

And the algorithm which it calls to calculate the distance is

 public static int LevenshteinDistance(this string input, string comparedTo, bool caseSensitive = false) { if (string.IsNullOrWhiteSpace(input) || string.IsNullOrWhiteSpace(comparedTo)) { return -1; } if (!caseSensitive) { input = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(input); comparedTo = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(comparedTo); } int inputLen = input.Length; int comparedToLen = comparedTo.Length; int[,] matrix = new int[inputLen, comparedToLen]; //initialize for (var i = 0; i < inputLen; i++) { matrix[i, 0] = i; } for (var i = 0; i < comparedToLen; i++) { matrix[0, i] = i; } //analyze for (var i = 1; i < inputLen; i++) { ushort si = input[i - 1]; for (var j = 1; j < comparedToLen; j++) { ushort tj = comparedTo[j - 1]; int cost = (si == tj) ? 0 : 1; int above = matrix[i - 1, j]; int left = matrix[i, j - 1]; int diag = matrix[i - 1, j - 1]; int cell = FindMinimumOptimized(above + 1, left + 1, diag + cost); //transposition if (i > 1 && j > 1) { int trans = matrix[i - 2, j - 2] + 1; if (input[i - 2] != comparedTo[j - 1]) { trans++; } if (input[i - 1] != comparedTo[j - 2]) { trans++; } if (cell > trans) { cell = trans; } } matrix[i, j] = cell; } } return matrix[inputLen - 1, comparedToLen - 1]; }

Find Minimum Optimized

 public static int FindMinimumOptimized(int a, int b, int c) { return Math.Min(a, Math.Min(b, c)); }

Please post the comple and possibly working code and example usage with sample data. Do not ommit anything like // save the data in tuple.. It makes reviewing it impossible. Also how can we suggest anthing if you do not even post the LevenshteinDistance... unless it's recursive and you didn't post the signature... well just post everything and some examples or unit-tests if you have any. — t3chb0t
– t3chb0t, Commented Feb 4, 2019 at 7:53
@Shahid: thanks. It looks like there's a problem with your Levenshtein implementation though: the distance between "a" and "" should be 1, not -1, and between "abc" and "def" should be 3, not 2. Also, why are you interpreting the Levenshtein distance as a similarity percentage? It's the number of differences, not a similarity rating... — Pieter Witvoet
– Pieter Witvoet, Commented Feb 4, 2019 at 10:14
If '70% accurate' translates to 'a maximum edit distance of 3 per 10 characters', then you can modify your Levenshtein method to bail out as soon as it knows the edit distance will be larger than allowed for the given words. Bailing out if the word lengths differ too much could also be a useful optimization (depending on the actual data). — Pieter Witvoet
– Pieter Witvoet, Commented Feb 6, 2019 at 12:22

greybeard · Accepted Answer · 2020-02-21 07:50:16Z

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
using offsets into the strings of -1 and, for transpositions, -2, which looks unusual
ignoring last characters, which looks erroneous(, if consequential)
it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein
given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

I actually managed to get the run times down by a major factor using this github.com/wolfgarbe/SymSpell — Shahid Manzoor Bhat
– Shahid Manzoor Bhat, Commented Feb 17, 2020 at 7:19
Digram counts tell something about order and proximity, but they seem harder to "search". While sum of absolute differences of character counts sets a lower limit for distance ld, I'm not there with digram counts - 4ld < accumulatedDifferences? — greybeard
– greybeard, Commented Feb 18, 2020 at 9:05

Stack Exchange Network

Levenshtein distance between each pair of elements from two large data sets

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Levenshtein distance between each pair of elements from two large data sets

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions