Approximate string comparators

Approximate string comparators TvungenOne, 2012-06-15 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga 1

Approximate string comparators? • Basically, measures of the similarity between two strings • Useful in situations where exact match is insufficient – record linkage – search – ... • Many of these are slow: O(n2) 2

Levenshtein • Also known as edit distance • Measures the number of edit operations necessary to turn s1 into s2 • Edit operations are – insert a character – remove a character – substitute a character 3

Levenshtein example • Levenshtein -> Löwenstein – Levenstein (remove „h‟) – Lövenstein (substitute „ö‟) – Löwenstein (substitute „w‟) • Edit distance = 3 4

Weighted Levenshtein • Not all edit operations are equal • Substituting “i” for “e” is a smaller edit than substituting “o” for “k” • Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0 • Difficult to implement – weights are also language-dependent 5

Jaro-Winkler • Developed at the US Bureau of the Census • For name comparisons – not well suited to long strings – best if given name/surname are separated • Exists in a few variants – originally proposed by Winkler – then modified by Jaro – a few different versions of modifications etc 6

Jaro-Winkler definition • Formula: – m = number of matching characters – t = number of transposed characters • A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away • Levenshtein ~ Löwenstein = 0.8 • Axel ~ Aksel = 0.783 7

Soundex • A coarse schema for matching names by sound – produces a key from the name – names match if key is the same • In common use in many places – Nav‟s person register uses it for search – built-in in many databases – ... 9

Examples • soundex(“Axel”) = „A240‟ • soundex(“Aksel”) = „A240‟ • soundex(“Levenshtein”) = „L523‟ • soundex(“Löwenstein”) = „L152‟ 11

Metaphone • Developed by Lawrence Philips • Similar to Soundex, but much more complex – both more accurate and more sensitive • Developed further into Double Metaphone • Metaphone 3.0 also exists, but only available commercially 12

Metaphone examples • metaphone(“Axel”) = „AKSL‟ • metaphone(“Aksel”) = „AKSL‟ • metaphone(“Levenshtein”) = „LFNX‟ • metaphone(“Löwenstein”) = „LWNS‟ 13

Dice coefficient • A similarity measure for sets – set can be tokens in a string – or characters in a string • Formula: 14

TFIDF • Compares strings as sets of tokens – a la Dice coefficient • However, takes frequency of tokens in corpus into account – this matches how we evaluate matches mentally • Has done well in evaluations – however, can be difficult to evaluate – results will change as corpus changes 15

More comparators • Smith-Waterman – originated in DNA sequencing • Q-grams distance – breaks string into sets of pieces of q characters – then does set similarity comparison • Monge-Elkan – similar to Smith-Waterman, but with affine gap distances – has done very well in evaluations – costly to evaluate • Many, many more – ... 16

Approximate string comparators

More Related Content

Similar to Approximate string comparators

More from Lars Marius Garshol

Recently uploaded

Approximate string comparators