Skip to main content
Bumped by Community user
Source Link
Stéphanie C
  • 281
  • 1
  • 2
  • 5

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled.

So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.

The solution I found (examples are in R) :

  • Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.

    agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE) ## [1] "accusait" "abusait"

  • The comparison of phonemes

    library(RecordLinkage) soundex(x<-c('accusait','acusait','abusait')) ## [1] "A223" "A223" "A123"

  • The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on address I guess.

  • I thought about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.

  • You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.