Approximate string comparators TvungenOne, 2012-06-15 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga 1
Approximate string comparators? • Basically, measures of the similarity between two strings • Useful in situations where exact match is insufficient – record linkage – search – ... • Many of these are slow: O(n2) 2
Levenshtein • Also known as edit distance • Measures the number of edit operations necessary to turn s1 into s2 • Edit operations are – insert a character – remove a character – substitute a character 3
Levenshtein example • Levenshtein -> Löwenstein – Levenstein (remove „h‟) – Lövenstein (substitute „ö‟) – Löwenstein (substitute „w‟) • Edit distance = 3 4
Weighted Levenshtein • Not all edit operations are equal • Substituting “i” for “e” is a smaller edit than substituting “o” for “k” • Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0 • Difficult to implement – weights are also language-dependent 5
Jaro-Winkler • Developed at the US Bureau of the Census • For name comparisons – not well suited to long strings – best if given name/surname are separated • Exists in a few variants – originally proposed by Winkler – then modified by Jaro – a few different versions of modifications etc 6
Jaro-Winkler definition • Formula: – m = number of matching characters – t = number of transposed characters • A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away • Levenshtein ~ Löwenstein = 0.8 • Axel ~ Aksel = 0.783 7
Jaro-Winkler variant 8
Soundex • A coarse schema for matching names by sound – produces a key from the name – names match if key is the same • In common use in many places – Nav‟s person register uses it for search – built-in in many databases – ... 9
Soundex definition 10
Examples • soundex(“Axel”) = „A240‟ • soundex(“Aksel”) = „A240‟ • soundex(“Levenshtein”) = „L523‟ • soundex(“Löwenstein”) = „L152‟ 11
Metaphone • Developed by Lawrence Philips • Similar to Soundex, but much more complex – both more accurate and more sensitive • Developed further into Double Metaphone • Metaphone 3.0 also exists, but only available commercially 12
Metaphone examples • metaphone(“Axel”) = „AKSL‟ • metaphone(“Aksel”) = „AKSL‟ • metaphone(“Levenshtein”) = „LFNX‟ • metaphone(“Löwenstein”) = „LWNS‟ 13
Dice coefficient • A similarity measure for sets – set can be tokens in a string – or characters in a string • Formula: 14
TFIDF • Compares strings as sets of tokens – a la Dice coefficient • However, takes frequency of tokens in corpus into account – this matches how we evaluate matches mentally • Has done well in evaluations – however, can be difficult to evaluate – results will change as corpus changes 15
More comparators • Smith-Waterman – originated in DNA sequencing • Q-grams distance – breaks string into sets of pieces of q characters – then does set similarity comparison • Monge-Elkan – similar to Smith-Waterman, but with affine gap distances – has done very well in evaluations – costly to evaluate • Many, many more – ... 16

Approximate string comparators

  • 1.
    Approximate string comparators TvungenOne,2012-06-15 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga 1
  • 2.
    Approximate string comparators? •Basically, measures of the similarity between two strings • Useful in situations where exact match is insufficient – record linkage – search – ... • Many of these are slow: O(n2) 2
  • 3.
    Levenshtein • Also knownas edit distance • Measures the number of edit operations necessary to turn s1 into s2 • Edit operations are – insert a character – remove a character – substitute a character 3
  • 4.
    Levenshtein example • Levenshtein-> Löwenstein – Levenstein (remove „h‟) – Lövenstein (substitute „ö‟) – Löwenstein (substitute „w‟) • Edit distance = 3 4
  • 5.
    Weighted Levenshtein • Notall edit operations are equal • Substituting “i” for “e” is a smaller edit than substituting “o” for “k” • Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0 • Difficult to implement – weights are also language-dependent 5
  • 6.
    Jaro-Winkler • Developed atthe US Bureau of the Census • For name comparisons – not well suited to long strings – best if given name/surname are separated • Exists in a few variants – originally proposed by Winkler – then modified by Jaro – a few different versions of modifications etc 6
  • 7.
    Jaro-Winkler definition • Formula: – m = number of matching characters – t = number of transposed characters • A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away • Levenshtein ~ Löwenstein = 0.8 • Axel ~ Aksel = 0.783 7
  • 8.
  • 9.
    Soundex • A coarseschema for matching names by sound – produces a key from the name – names match if key is the same • In common use in many places – Nav‟s person register uses it for search – built-in in many databases – ... 9
  • 10.
  • 11.
    Examples • soundex(“Axel”) = „A240‟ • soundex(“Aksel”) = „A240‟ • soundex(“Levenshtein”) = „L523‟ • soundex(“Löwenstein”) = „L152‟ 11
  • 12.
    Metaphone • Developed byLawrence Philips • Similar to Soundex, but much more complex – both more accurate and more sensitive • Developed further into Double Metaphone • Metaphone 3.0 also exists, but only available commercially 12
  • 13.
    Metaphone examples • metaphone(“Axel”) = „AKSL‟ • metaphone(“Aksel”) = „AKSL‟ • metaphone(“Levenshtein”) = „LFNX‟ • metaphone(“Löwenstein”) = „LWNS‟ 13
  • 14.
    Dice coefficient • Asimilarity measure for sets – set can be tokens in a string – or characters in a string • Formula: 14
  • 15.
    TFIDF • Compares stringsas sets of tokens – a la Dice coefficient • However, takes frequency of tokens in corpus into account – this matches how we evaluate matches mentally • Has done well in evaluations – however, can be difficult to evaluate – results will change as corpus changes 15
  • 16.
    More comparators • Smith-Waterman – originated in DNA sequencing • Q-grams distance – breaks string into sets of pieces of q characters – then does set similarity comparison • Monge-Elkan – similar to Smith-Waterman, but with affine gap distances – has done very well in evaluations – costly to evaluate • Many, many more – ... 16