How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled.

So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.

The solution I found (examples are in R) :

Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.

agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE) ## [1] "accusait" "abusait"
The comparison of phonemes

library(RecordLinkage) soundex(x<-c('accusait','acusait','abusait')) ## [1] "A223" "A223" "A123"
The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on address I guess.
I thought about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.
You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.

text-mining

Stack Exchange Network

Return to Question

How to do postal addresses fuzzy matching?