Skip to main content
mention error
Source Link
greybeard
  • 7.8k
  • 3
  • 21
  • 56

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
  • "all about a string" is in its suffix array
    Then, there is FM-index (Fulltext-Minute-)

  • LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
    using offsets into the strings of -1 and, for transpositions, -2, which looks unusual
    ignoring last characters, which looks erroneous(, if consequential)
  • it allocates an array with size the product of the lengths involved.
    Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
    reducing this by one row has been done for Levenshtein
  • given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
  • your code follows the mathematical presentation of the measure closely, missing to bail out early:
    if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
  • "all about a string" is in its suffix array
    Then, there is FM-index (Fulltext-Minute-)

  • LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
  • it allocates an array with size the product of the lengths involved.
    Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
    reducing this by one row has been done for Levenshtein
  • given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
  • your code follows the mathematical presentation of the measure closely, missing to bail out early:
    if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
  • "all about a string" is in its suffix array
    Then, there is FM-index (Fulltext-Minute-)

  • LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
    using offsets into the strings of -1 and, for transpositions, -2, which looks unusual
    ignoring last characters, which looks erroneous(, if consequential)
  • it allocates an array with size the product of the lengths involved.
    Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
    reducing this by one row has been done for Levenshtein
  • given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
  • your code follows the mathematical presentation of the measure closely, missing to bail out early:
    if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less
transpositions are weird, restricted or not
Source Link
greybeard
  • 7.8k
  • 3
  • 21
  • 56

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (Damerauunrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
  • "all about a string" is in its suffix array
    Then, there is FM-index (Fulltext-Minute-)

  • LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
  • it allocates an array with size the product of the lengths involved.
    Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
    reducing this by one row has been done for Levenshtein
  • given the monotonicity (allowing for -1 in case of transposition?) of edit distance and an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
  • your code follows the mathematical presentation of the measure closely, missing to bail out early:
    if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and neighbour transpositions (Damerau)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
  • "all about a string" is in its suffix array
    Then, there is FM-index (Fulltext-Minute-)

  • LevenshteinDistance() seems to compute the Damerau-Levenshtein distance
  • it allocates an array with size the product of the lengths involved.
    Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
    reducing this by one row has been done for Levenshtein
  • given the monotonicity (allowing for -1 in case of transposition?) of edit distance and an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
  • your code follows the mathematical presentation of the measure closely, missing to bail out early:
    if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and unrestricted neighbour transpositions (unrestricted Damerau-Levenshtein - still not sure whether there is a generally accepted nomenclature)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
  • "all about a string" is in its suffix array
    Then, there is FM-index (Fulltext-Minute-)

  • LevenshteinDistance() seems to compute the restricted Damerau-Levenshtein distance
  • it allocates an array with size the product of the lengths involved.
    Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
    reducing this by one row has been done for Levenshtein
  • given an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
  • your code follows the mathematical presentation of the measure closely, missing to bail out early:
    if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less
remarks on LevenshteinDistance()
Source Link
greybeard
  • 7.8k
  • 3
  • 21
  • 56

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and neighbour transpositions (Damerau)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
  • "all about a string" is in its suffix array
    Then, there is FM-index (Fulltext-Minute-)

  • LevenshteinDistance() seems to compute the Damerau-Levenshtein distance
  • it allocates an array with size the product of the lengths involved.
    Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
    reducing this by one row has been done for Levenshtein
  • given the monotonicity (allowing for -1 in case of transposition?) of edit distance and an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
  • your code follows the mathematical presentation of the measure closely, missing to bail out early:
    if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and neighbour transpositions (Damerau)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)

(I'm tempted to comment on, I guess, program development in general or getting more out of CodeReview@SE as opposed to reviewing the code.)

  • Document your code. In the code. Use what tool support you can get, for "the C-family", I use&recommend doxygen.
    • Document the external/public interface:
    What is the purpose is PerformFuzzyMatch()? All I can see is the locally declared currentBatchMatchedWords growing.
    What are FuzzyMatchRepo and FuzzyMatchRepository?
    • Comment why everything non-trivial is there.
  • going parallel seems the way to go - depending on approach, it may prove advantageous to estimate whether exchanging the roles of list A and B promises an improvement.
    The fixed group/batch size looks debatable.
  • the fixed similarity limit looks inflexible -
    use a function as a parameter?
  • computing the exact distance when interested in similar enough is wasteful as commented by Pieter Witvoet

To find solutions to evaluate, I find it useful to collect what seems helpful, with an eye on what in the problem at hand differs from well known problems:

  • sorting the lists by length&contents allows
    • weeding out words occurring than once
    • establishing bounds on similarity
    • reusing partial results in "dynamic computation"
  • character histograms help establish bounds on distance:
    sum of absolute differences is a lower bound
  • the most simple part carrying information about relative position is digram/(ordered)pair of characters
  • the triangle inequality holds for the edit distances counting replacements (Hamming), insertions&deletions (Levenshtein) and neighbour transpositions (Damerau)
    This is related to what "the Wei/Chuan/Xuemin/Chengqi paper" seems to exploit (from glossing that over)
  • "all about a string" is in its suffix array
    Then, there is FM-index (Fulltext-Minute-)

  • LevenshteinDistance() seems to compute the Damerau-Levenshtein distance
  • it allocates an array with size the product of the lengths involved.
    Wagner-Fischer reduces this to two rows for Levenshtein; it should be easy to extend this to one more row for Damerau-Levenshtein
    reducing this by one row has been done for Levenshtein
  • given the monotonicity (allowing for -1 in case of transposition?) of edit distance and an upper bound on distance (like length of longer string), don't compute entries that exceed that bound (Ukkonen's optimisation - en.wikipedia mentions single-row implementation but not this one?!).
  • your code follows the mathematical presentation of the measure closely, missing to bail out early:
    if the characters match or the current character is the 2nd of a transposition, the distance does not increase, and no other edit costs less
start a proper code review
Source Link
greybeard
  • 7.8k
  • 3
  • 21
  • 56
Loading
added 774 characters in body
Source Link
greybeard
  • 7.8k
  • 3
  • 21
  • 56
Loading
Source Link
greybeard
  • 7.8k
  • 3
  • 21
  • 56
Loading