C++ edit distance / string similarity function based on the Jaro-Winkler algorithm

Question

I wrote a short library function, based on an example from Rosetta, to compare two strings and determine similarity, using Jaro-Winkler.

A short copy-paste ready example:

main.cpp

#include <string> #include <iostream> #include "jaro_winkler.hpp" int main() { std::string a { "DWAYNE" }; std::string b { "DUANE" }; std::cout << "Similarity for '" << a << "' and '" << b << "': " << edit_distance::jaro_winkler(a, b) << std::endl; std::cout << "Similarity for 'MARTHA' and 'MARHTA': " << edit_distance::jaro_winkler("MARTHA", "MARHTA") << std::endl; std::cout << "Similarity for 'DIXON' and 'DICKSONX': " << edit_distance::jaro_winkler("DIXON", "DICKSONX") << std::endl; std::cout << "Similarity for 'JELLYFISH' and 'SMELLYFISH': " << edit_distance::jaro_winkler("JELLYFISH", "SMELLYFISH") << std::endl; }

jaro_winkler.hpp

#pragma once #include <cstddef> #include <cstdint> #include <algorithm> #include <string_view> namespace edit_distance { template <typename T = float> inline T jaro(const std::string_view source, const std::string_view target) { const unsigned sl = source.length(); const unsigned tl = target.length(); if (sl == 0 || tl == 0) return 0; const auto match_distance = (sl == 1 && tl == 1) ? 0 : (std::max(sl, tl) / 2 - 1); auto source_matches = new bool[sl] {0}; auto target_matches = new bool[tl] {0}; unsigned matches = 0; for (unsigned i = 0; i < sl; ++i) { const auto end = std::min(i + match_distance + 1, tl); for (auto k = i > match_distance ? (i - match_distance) : 0u; k < end; ++k) { if (!target_matches[k] && source[i] == target[k]) { source_matches[i] = true; target_matches[k] = true; ++matches; break; } } } if (matches == 0) { delete[] source_matches; delete[] target_matches; return 0; } T t = 0.0; unsigned k = 0; for (unsigned i = 0; i < sl; ++i) { if (source_matches[i]) { while (!target_matches[k]) ++k; if (source[i] != target[k]) t += 0.5; ++k; } } const T m = matches; delete[] source_matches; delete[] target_matches; return (m / sl + m / tl + (m - t) / m) / 3.0; } template <typename T = float> inline T jaro_winkler(const std::string_view source, const std::string_view target, const unsigned prefix = 2, const T boost_treshold = 0.7, const T scaling_factor = 0.1) { const auto similarity = jaro<T>(source, target); if (similarity > boost_treshold) { int common_prefix = 0; const int l = std::min((unsigned)std::min(source.length(), target.length()), prefix); for (; common_prefix < l; ++common_prefix) { if (source[common_prefix] != target[common_prefix]) break; } return similarity + scaling_factor * common_prefix * (1 - similarity); } else { return similarity; } } } // namespace edit_distance

I'd be happy to hear any comments or critiques.

Arnav Borborah · Accepted Answer · 2018-02-21 11:41:13Z

Here are a few things that I immediately see in this code that have room for improvement:

template <typename T = float>

Assuming you only want floating point values passed as template parameters (Since you do floating point arithmetic later on), you could add:

static_assert(std::is_floating_point<T>::value, "jaro can only be used with floating point types");

const unsigned sl = source.length(); const unsigned tl = target.length();

While the above may work for your code, the type for the length of the std::string_view isn't always guaranteed to be a plain unsigned int. Consider using std::size_t as the variable type, or std::string_view::size_type.

auto source_matches = new bool[sl] {0}; auto target_matches = new bool[tl] {0};

The use of auto is confusing here. It is hard to tell at first glance that you are meaning to use a dynamic array. auto doesn't shorten the type much either, so you may want to remove it.
Raw dynamic arrays are a large source of error. Consider using std::vector instead. This will also allow you to remove the deletes later on in your code.

for (unsigned i = 0; i < sl; ++i)

You could use std::size_t here. Don't make assumptions about unsigned types.

delete[] source_matches; delete[] target_matches;

Using a std::vector would allow you to remove this.

(unsigned)std::min(source.length(), target.length())

If you have to use a cast, prefer using static_cast, as it makes your intent clearer.

Thanks for your review! You are entirely correct that floating points should be enforced (through static_assert or enable_if). I think you missed the point in your second remark: the code specifies unsigned specifically to... store the size in an unsigned. If I'd use auto as you suggest it's whatever the platform dictates, which breaks further use of the variables (std::min doesn't allow mixing type sizes or signedness, leading to fragile code. About your 3rd remark: The deduction for auto is in the same line and reduces duplication of information, it's a style thing I guess. — Stefan
– Stefan, Commented Feb 21, 2018 at 7:58
About auto for i: Same remark as before, the whole goal is to be explicit, auto will deduce 0 to int. You say don't assume, but that's exactly what you're suggesting as a replacement. If you'd want to take it further you could go for decltype(sl), that would be more correct and far less clear. About using a std::vector, this is a matter of preference. I checked and a vector is very slow by comparison, this is performance sensitive code and the scope is very clear, hence the choice for manual memory management in this specific case. — Stefan
– Stefan, Commented Feb 21, 2018 at 8:04
If you use auto i, you'll want to change its initializer to 0u. But as Stefan says, you really want it to match s1. — Toby Speight
– Toby Speight, Commented Feb 21, 2018 at 9:48
@Stefan, how did you compare the speed of std::vector against new bool[]? And were you using std::vector<bool>, which has known problems? (It's more compact than your array, at a cost of performance and substitutability.) A closer equivalent would be std::vector<unsigned char>. If you really want the vector, I'd recommend std::make_unique<bool[]>(s1) as a safer alternative to the raw pointer. — Toby Speight
– Toby Speight, Commented Feb 21, 2018 at 9:51
I ran a range of types and compile flags (O0, O2, O3, Ofast) just to reconfirm earlier experiences, from bool to vectors of str::string and unsigned (close to a pointer in size). Tested reserve(n) and allocation inside and outside of loops. Closest I could get a vector is half the performance of a plain heap array. I took the unsigned as general case as it seemed indeed to have least issues. I'm aware that more complex functions will dwarf the difference. Interestingly std::array was slower as well, even though it shouldn't have been. std::make_unique is a good suggestion. — Stefan
– Stefan, Commented Feb 21, 2018 at 10:06

Toby Speight · Accepted Answer · 2018-02-21 13:21:52Z

Naming

Because the namespace is called edit_distance, it's easy to assume that jaro() means Jaro distance, but it actually computes the Jaro similarity. Similarly for jaro_winkler(). It may be worth adding a suffix to disambiguate them.

Include your own headers first

The test program includes <string> and <iostream> before "jaro_winkler.hpp". Generally, I prefer the opposite order, to avoid masking forgotten includes in my own headers.

Refactor the tests

There's a lot of repetition in the tests, which can be reduced:

static void show_distance(std::string_view a, std::string_view b) { std::cout << "Similarity for '" << a << "' and '" << b << "': " << edit_distance::jaro_winkler_similarity(a, b) << std::endl; } int main() { show_distance(std::string{"DWAYNE"}, "DUANE"); show_distance("MARTHA", "MARHTA"); show_distance("DIXON", "DICKSONX"); show_distance("JELLYFISH", "SMELLYFISH"); }

Default to `double`

In C and C++, double is the default floating-point type; float and long double have to be specifically requested. It's best to stay consistent with this, so I would write template <typename T = double>.

You can use SFINAE to prevent non-floating-point types from being seen:

template <typename T = double> inline std::enable_if_t<std::is_floating_point_v<T>, T> jaro_similarity(const std::string_view source, const std::string_view target)

What if both strings are empty?

I'd argue that two equal strings must always have a similarity of one, even if they are both empty. So I'd re-write that test:

 if (source == target) return 1; if (source.empty() || target.empty()) return 0;

I think source.empty() indicates the intent slightly better than sl == 0, but that might be just me.

Simplify a test

 const auto match_distance = (sl == 1 && tl == 1) ? 0 : (std::max(sl, tl) / 2 - 1);

Since we're using std::max(sl, tl) in one branch (and since std::max() is declared constexpr), it may be cheaper/clearer to use it for the test, too:

 const auto match_distance = std::max(sl, tl) < 2 ? 0 : std::max(sl, tl) / 2 - 1;

Avoid raw pointers

Instead of creating new bool[], we normally prefer objects that C++ will clean up for us. The usual approach is to create a std::vector, but when the type is bool, that selects the (non-Standard-Container) specialization, which we don't want, for reasons of speed. We could use a std::vector<char> (or <unsigned char>), or we could stick with a bool[] but ask a smart pointer to manage it:

 auto source_matches = std::make_unique<bool[]>(sl); auto target_matches = std::make_unique<bool[]>(tl);

Prefer `std::size_t` over `unsigned int` for indexes

The standard size type is guaranteed to have enough range for even the longest array.

Count using an integer type

Instead of accumulating 0.5 at a time, we can accumulate by 1 at a time, and divide at the end:

 std::size_t t = 0; std::size_t k = 0; for (std::size_t i = 0; i < sl; ++i) { if (source_matches[i]) { while (!target_matches[k]) ++k; if (source[i] != target[k]) ++t; ++k; } } const T m = matches; return (m/sl + m/tl + 1 - t/m/2) / 3.0;

A tiny typo

I'm guessing boost_treshold should be boost_threshold.

Simplify the nested `min()` calls

If we change the type of prefix to match source.length() and target.length(), we can use the version of std::min() that accepts an initializer list:

 const auto l = std::min({source.length(), target.length(), prefix});

My version

#include <algorithm> #include <cstddef> #include <memory> #include <string_view> #include <type_traits> namespace edit_distance { template <typename T = double> inline std::enable_if_t<std::is_floating_point_v<T>, T> jaro_similarity(const std::string_view source, const std::string_view target) { if (source == target) return 1; if (source.empty() || target.empty()) return 0; const auto sl = source.length(); const auto tl = target.length(); const auto match_distance = std::max(sl, tl) < 2 ? 0 : std::max(sl, tl) / 2 - 1; auto source_matches = std::make_unique<bool[]>(sl); auto target_matches = std::make_unique<bool[]>(tl); std::size_t matches = 0; for (std::size_t i = 0; i < sl; ++i) { const auto end = std::min(i + match_distance + 1, tl); const auto start = i > match_distance ? (i - match_distance) : 0u; for (auto k = start; k < end; ++k) { if (!target_matches[k] && source[i] == target[k]) { target_matches[k] = source_matches[i] = true; ++matches; break; } } } if (matches == 0) { return 0; } std::size_t t = 0; for (std::size_t i = 0, k = 0; i < sl; ++i) { if (source_matches[i]) { while (!target_matches[k]) ++k; if (source[i] != target[k]) ++t; ++k; } } const T m = matches; return (m/sl + m/tl + 1 - t/m/2) / 3.0; } template <typename T = double> inline T jaro_winkler_similarity(const std::string_view source, const std::string_view target, const std::size_t prefix = 2, const T boost_treshold = 0.7, const T scaling_factor = 0.1) { const auto similarity = jaro_similarity<T>(source, target); if (similarity > boost_treshold) { const auto l = std::min({source.length(), target.length(), prefix}); std::size_t common_prefix = 0; for (; common_prefix < l; ++common_prefix) { if (source[common_prefix] != target[common_prefix]) break; } return similarity + scaling_factor * common_prefix * (1 - similarity); } else { return similarity; } } } // namespace edit_distance // Test program #include <string> #include <iostream> static void show_distance(std::string_view a, std::string_view b) { std::cout << "Similarity for '" << a << "' and '" << b << "': " << edit_distance::jaro_winkler_similarity(a, b) << std::endl; } int main() { show_distance(std::string{"DWAYNE"}, "DUANE"); show_distance("MARTHA", "MARHTA"); show_distance("DIXON", "DICKSONX"); show_distance("JELLYFISH", "SMELLYFISH"); }

About the redundant test bit: Can you take a look at this broken-out example? godbolt.org/g/9V7RsK — Stefan
– Stefan, Commented Feb 21, 2018 at 11:14

Edward · Accepted Answer · 2018-02-21 16:27:57Z

The other reviews have hit most of the points I'd have made, so this is a complementary review that looks primarily at the actual algorithm and testing. I went to the source, a 2006 paper by Winkler and extracted test vectors from a table on page 12.

I wanted to automate the testing in the simplest way possible (because I'm lazy!) and so I used this test harness.

main.cpp

#include "test.h" #include "jaro_winkler.hpp" #include <string_view> #include <iostream> float jw(const std::string_view a, const std::string_view b) { return edit_distance::jaro_winkler(a, b, 4, 0.78, 0.1); } int main() { bool passed{true}; for (const auto &t: tests) { passed &= t(jw); } std::cout << "\nAll tests " << (passed ? "passed" : "did NOT pass!") << '\n'; }

test.h

#ifndef TEST_H #define TEST_H #include <string> #include <string_view> #include <vector> class Test { public: Test(const std::string &a, const std::string &b, float expected); bool operator()(float (&func)(const std::string_view a, const std::string_view b)) const; private: std::string a, b; float expected; }; extern const std::vector<Test> tests; #endif // TEST_H

test.cpp

#include "test.h" #include <vector> #include <cmath> #include <string> #include <iostream> #include <iomanip> Test::Test(const std::string &a, const std::string &b, float expected) : a{a}, b{b}, expected{expected} { } bool Test::operator()(float (&func)(const std::string_view a, const std::string_view b)) const { constexpr float epsilon{0.0005}; const auto dist{func(a, b)}; const auto delta{std::abs(dist - expected)}; const auto result{epsilon > delta}; std::cout << std::setw(7) << std::boolalpha << result << std::setw(15) << a << std::setw(15) << b << '\t' << std::fixed << std::setprecision(3) << dist << '\t' << std::setprecision(4) << delta << '\n'; return result; } /* * These sample values are from * "Overview of Record Linkage and Current Research Directions" * Winkler, W. (2006) * https://www.census.gov/srd/papers/pdf/rrs2006-02.pdf */ const std::vector<Test> tests{ { "SHACKLEFORD", "SHACKELFORD", 0.982 }, { "DUNNINGHAM", "CUNNIGHAM", 0.896 }, { "NICHLESON", "NICHULSON", 0.956 }, { "JONES", "JOHNSON", 0.832 }, { "MASSEY", "MASSIE", 0.933 }, { "ABROMS", "ABRAMS", 0.922 }, { "HARDIN", "MARTINEZ", 0.000 }, { "ITMAN", "SMITH", 0.000 }, { "JERALDINE", "GERALDINE", 0.926 }, { "MARHTA", "MARTHA", 0.961 }, { "MICHELLE", "MICHAEL", 0.921 }, { "JULIES", "JULIUS", 0.933 }, { "TANYA", "TONYA", 0.880 }, { "DWAYNE", "DUANE", 0.840 }, { "SEAN", "SUSAN", 0.805 }, { "JON", "JOHN", 0.933 }, { "JON", "JAN", 0.000 }, };

Results

When compiled and linked with your original code (and with the given parameters as shown in main) I found almost perfect agreement. The only difference was apparently that while values above the threshold are boosted (as in your code), values below are clamped to 0. I made the appropriate change to the code and the result is that all values match:

 true SHACKLEFORD SHACKELFORD 0.982 0.0002 true DUNNINGHAM CUNNIGHAM 0.896 0.0003 true NICHLESON NICHULSON 0.956 0.0004 true JONES JOHNSON 0.832 0.0004 true MASSEY MASSIE 0.933 0.0003 true ABROMS ABRAMS 0.922 0.0002 true HARDIN MARTINEZ 0.000 0.0000 true ITMAN SMITH 0.000 0.0000 true JERALDINE GERALDINE 0.926 0.0001 true MARHTA MARTHA 0.961 0.0001 true MICHELLE MICHAEL 0.921 0.0004 true JULIES JULIUS 0.933 0.0003 true TANYA TONYA 0.880 0.0000 true DWAYNE DUANE 0.840 0.0000 true SEAN SUSAN 0.805 0.0000 true JON JOHN 0.933 0.0003 true JON JAN 0.000 0.0000 All tests passed

Stack Exchange Network

C++ edit distance / string similarity function based on the Jaro-Winkler algorithm

main.cpp

jaro_winkler.hpp

3 Answers 3

Naming

Include your own headers first

Refactor the tests

Default to `double`

What if both strings are empty?

Simplify a test

Avoid raw pointers

Prefer `std::size_t` over `unsigned int` for indexes

Count using an integer type

A tiny typo

Simplify the nested `min()` calls

My version

main.cpp

test.h

test.cpp

Results

You must log in to answer this question.

Hot Network Questions

C++ edit distance / string similarity function based on the Jaro-Winkler algorithm

main.cpp

jaro_winkler.hpp

3 Answers 3

Naming

Include your own headers first

Refactor the tests

Default to double

What if both strings are empty?

Simplify a test

Avoid raw pointers

Prefer std::size_t over unsigned int for indexes

Count using an integer type

A tiny typo

Simplify the nested min() calls

My version

main.cpp

test.h

test.cpp

Results

You must log in to answer this question.

Related

Hot Network Questions

Default to `double`

Prefer `std::size_t` over `unsigned int` for indexes

Simplify the nested `min()` calls