(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)
- Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose isPerformFuzzyMatch()? All I can see is the locally declaredcurrentBatchMatchedWordsgrowing.
What areFuzzyMatchRepoandFuzzyMatchRepository?
• Comment why everything non-trivial is there. - going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable. - the fixed similarity limit looks inflexible -
use a function as a parameter? - computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet
To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:
- sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation" - character histograms help establish bounds on distance:
sum of absolute differences is a lower bound - the most simple part carrying information about relative position is digram/(ordered)pair of characters
- the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over) - "all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)
LevenshteinDistance()seems to compute the restricted Damerau-Levenshtein distance
using offsets into the strings of -1 and, for transpositions, -2, which looks unusual
ignoring last characters, which looks erroneous(, if consequential)- it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein - given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
- your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less