Return to Answer

mention error

edited Feb 21, 2020 at 7:50

7.8k
3
21
56

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
using offsets into the strings of -1 and, for transpositions, -2, which looks unusual
ignoring last characters, which looks erroneous(, if consequential)
it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein
given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein
given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
using offsets into the strings of -1 and, for transpositions, -2, which looks unusual
ignoring last characters, which looks erroneous(, if consequential)
it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein
given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

transpositions are weird, restricted or not

Source Link

edited Feb 20, 2020 at 22:58

greybeard

7.8k
3
21
56

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (Damerauunrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein
given the monotonicity (allowing for -1 in case of transposition?) of edit distance and an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and neighbour transpositions (Damerau)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the Damerau-Levenshtein distance
it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein
given the monotonicity (allowing for -1 in case of transposition?) of edit distance and an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein
given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

remarks on LevenshteinDistance()

Source Link

edited Feb 20, 2020 at 11:00

greybeard

7.8k
3
21
56

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and neighbour transpositions (Damerau)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)

"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the Damerau-Levenshtein distance

it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein

given the monotonicity (allowing for -1 in case of transposition?) of edit distance and an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).

your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and neighbour transpositions (Damerau)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
• Document the external/public interface:
What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
What are FuzzyMatchRepo and FuzzyMatchRepository?
• Comment why everything non-trivial is there.
going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
The fixed group/batch size looks debatable.
the fixed similarity limit looks inflexible -
use a function as a parameter?
computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

sorting the lists by length&contents allows
• weeding out words occurring than once
• establishing bounds on similarity
• reusing partial results in "dynamic computation"
character histograms help establish bounds on distance:
sum of absolute differences is a lower bound
the most simple part carrying information about relative position is digram/(ordered)pair of characters
the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and neighbour transpositions (Damerau)
This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)

"all about a string" is in its suffix array
Then, there is FM-index (Fulltext-Minute-)

LevenshteinDistance() seems to compute the Damerau-Levenshtein distance

it allocates an array with size the product of the lengths involved.
Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
reducing this by one row has been done for Levenshtein

given the monotonicity (allowing for -1 in case of transposition?) of edit distance and an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).

your code follows the mathematical presentation of the measure closely, missing to bail out early:
if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less