I am currently working on a problem to find the best matches of data in a List namely "ListA" from another List called "ListB". Whenever I find a match of an element for "ListA" with any element in "ListB" which has a confidence and accuracy with 70% or greater I add the matched string from List B and the string in List A to a tuple which I further save in a database.
Levenshtein algorithm gives me the distance between the word in ListA and words in ListB, I use the distance to calculate the similarity between the words and compare them with my threshold of 70% and if the values returned is equal or greater than the 70 percent threshold then I add the results which are either 70% or greater to the tuple.
The code that I have written for this process works fine if the records in "ListA" and "ListB" are within thousands of values and if I increase the records to a million it takes about an hour to calculate the distance for each element of the List A.
I need to optimize the process for huge data sets. Please advise where do I need to make the improvements.
My code for the process so far looks like this
public static PerformFuzzyMatch() { // Fetch the ListA & List B from SQL tables var ListACount = await FuzzyMatchRepo.FetchListACount(); var ListB = await FuzzyMatchRepo.FetchListBAsync(); //Split the ListA data to smaller chunks and loop through those chunks var splitGroupSize = 1000; var sourceDataBatchesCount = ListACount / splitGroupSize; // Loop through the smaller chunks of List A for (int b = 0; b < sourceDataBatchesCount; b++) { var currentBatchMatchedWords = new List<Tuple<string, string, double>>(); int skipRowCount = b * splitGroupSize; int takeRowCount = splitGroupSize; // Get chunks of data from ListA according to the skipRowCount and takeRowCount var currentSourceDataBatch = FuzzyMatchRepository.FetchSourceDataBatch(skipRowCount, takeRowCount); //Loop through the ListB and parallely calculate the distance between chunks of List A and List B data for (int i = 0; i < ListB.Count; i++) { Parallel.For( 0, currentSourceDataBatch.Count, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 }, cntr => { try { // call the Levenshtein Algorithm to calculate the distance between each element of ListB and the smaller chunk of List A. int leven = LevenshteinDistance(currentSourceDataBatch[cntr], ListB[i]); int length = Math.Max(currentSourceDataBatch[cntr].Length, ListB[i].Length); double similarity = double similarity = 1.0 - (double)leven / length; if ((similarity * 100) >= 70) { currentBatchMatchedWords.Add(Tuple.Create(currentSourceDataBatch[cntr], ListB[i], similarity)); } cntr++; } catch (Exception ex) { exceptions.Enqueue(ex); } }); } } } And the algorithm which it calls to calculate the distance is
public static int LevenshteinDistance(this string input, string comparedTo, bool caseSensitive = false) { if (string.IsNullOrWhiteSpace(input) || string.IsNullOrWhiteSpace(comparedTo)) { return -1; } if (!caseSensitive) { input = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(input); comparedTo = Common.Hashing.InvariantUpperCaseStringExtensions.ToUpperInvariant(comparedTo); } int inputLen = input.Length; int comparedToLen = comparedTo.Length; int[,] matrix = new int[inputLen, comparedToLen]; //initialize for (var i = 0; i < inputLen; i++) { matrix[i, 0] = i; } for (var i = 0; i < comparedToLen; i++) { matrix[0, i] = i; } //analyze for (var i = 1; i < inputLen; i++) { ushort si = input[i - 1]; for (var j = 1; j < comparedToLen; j++) { ushort tj = comparedTo[j - 1]; int cost = (si == tj) ? 0 : 1; int above = matrix[i - 1, j]; int left = matrix[i, j - 1]; int diag = matrix[i - 1, j - 1]; int cell = FindMinimumOptimized(above + 1, left + 1, diag + cost); //transposition if (i > 1 && j > 1) { int trans = matrix[i - 2, j - 2] + 1; if (input[i - 2] != comparedTo[j - 1]) { trans++; } if (input[i - 1] != comparedTo[j - 2]) { trans++; } if (cell > trans) { cell = trans; } } matrix[i, j] = cell; } } return matrix[inputLen - 1, comparedToLen - 1]; } Find Minimum Optimized
public static int FindMinimumOptimized(int a, int b, int c) { return Math.Min(a, Math.Min(b, c)); }
LevenshteinDistance... unless it's recursive and you didn't post the signature... well just post everything and some examples or unit-tests if you have any. \$\endgroup\$"a"and""should be 1, not -1, and between"abc"and"def"should be 3, not 2. Also, why are you interpreting the Levenshtein distance as a similarity percentage? It's the number of differences, not a similarity rating... \$\endgroup\$