Naming
Because the namespace is called edit_distance, it's easy to assume that jaro() means Jaro distance, but it actually computes the Jaro similarity. Similarly for jaro_winkler(). It may be worth adding a suffix to disambiguate them.
Include your own headers first
The test program includes <string> and <iostream> before "jaro_winkler.hpp". Generally, I prefer the opposite order, to avoid masking forgotten includes in my own headers.
Refactor the tests
There's a lot of repetition in the tests, which can be reduced:
static void show_distance(std::string_view a, std::string_view b) { std::cout << "Similarity for '" << a << "' and '" << b << "': " << edit_distance::jaro_winkler_similarity(a, b) << std::endl; } int main() { show_distance(std::string{"DWAYNE"}, "DUANE"); show_distance("MARTHA", "MARHTA"); show_distance("DIXON", "DICKSONX"); show_distance("JELLYFISH", "SMELLYFISH"); }
Default to double
In C and C++, double is the default floating-point type; float and long double have to be specifically requested. It's best to stay consistent with this, so I would write template <typename T = double>.
You can use SFINAE to prevent non-floating-point types from being seen:
template <typename T = double> inline std::enable_if_t<std::is_floating_point_v<T>, T> jaro_similarity(const std::string_view source, const std::string_view target)
What if both strings are empty?
I'd argue that two equal strings must always have a similarity of one, even if they are both empty. So I'd re-write that test:
if (source == target) return 1; if (source.empty() || target.empty()) return 0;
I think source.empty() indicates the intent slightly better than sl == 0, but that might be just me.
Simplify a test
const auto match_distance = (sl == 1 && tl == 1) ? 0 : (std::max(sl, tl) / 2 - 1);
Since we're using std::max(sl, tl) in one branch (and since std::max() is declared constexpr), it may be cheaper/clearer to use it for the test, too:
const auto match_distance = std::max(sl, tl) < 2 ? 0 : std::max(sl, tl) / 2 - 1;
Avoid raw pointers
Instead of creating new bool[], we normally prefer objects that C++ will clean up for us. The usual approach is to create a std::vector, but when the type is bool, that selects the (non-Standard-Container) specialization, which we don't want, for reasons of speed. We could use a std::vector<char> (or <unsigned char>), or we could stick with a bool[] but ask a smart pointer to manage it:
auto source_matches = std::make_unique<bool[]>(sl); auto target_matches = std::make_unique<bool[]>(tl);
Prefer std::size_t over unsigned int for indexes
The standard size type is guaranteed to have enough range for even the longest array.
Count using an integer type
Instead of accumulating 0.5 at a time, we can accumulate by 1 at a time, and divide at the end:
std::size_t t = 0; std::size_t k = 0; for (std::size_t i = 0; i < sl; ++i) { if (source_matches[i]) { while (!target_matches[k]) ++k; if (source[i] != target[k]) ++t; ++k; } } const T m = matches; return (m/sl + m/tl + 1 - t/m/2) / 3.0;
A tiny typo
I'm guessing boost_treshold should be boost_threshold.
Simplify the nested min() calls
If we change the type of prefix to match source.length() and target.length(), we can use the version of std::min() that accepts an initializer list:
const auto l = std::min({source.length(), target.length(), prefix});
My version
#include <algorithm> #include <cstddef> #include <memory> #include <string_view> #include <type_traits> namespace edit_distance { template <typename T = double> inline std::enable_if_t<std::is_floating_point_v<T>, T> jaro_similarity(const std::string_view source, const std::string_view target) { if (source == target) return 1; if (source.empty() || target.empty()) return 0; const auto sl = source.length(); const auto tl = target.length(); const auto match_distance = std::max(sl, tl) < 2 ? 0 : std::max(sl, tl) / 2 - 1; auto source_matches = std::make_unique<bool[]>(sl); auto target_matches = std::make_unique<bool[]>(tl); std::size_t matches = 0; for (std::size_t i = 0; i < sl; ++i) { const auto end = std::min(i + match_distance + 1, tl); const auto start = i > match_distance ? (i - match_distance) : 0u; for (auto k = start; k < end; ++k) { if (!target_matches[k] && source[i] == target[k]) { target_matches[k] = source_matches[i] = true; ++matches; break; } } } if (matches == 0) { return 0; } std::size_t t = 0; for (std::size_t i = 0, k = 0; i < sl; ++i) { if (source_matches[i]) { while (!target_matches[k]) ++k; if (source[i] != target[k]) ++t; ++k; } } const T m = matches; return (m/sl + m/tl + 1 - t/m/2) / 3.0; } template <typename T = double> inline T jaro_winkler_similarity(const std::string_view source, const std::string_view target, const std::size_t prefix = 2, const T boost_treshold = 0.7, const T scaling_factor = 0.1) { const auto similarity = jaro_similarity<T>(source, target); if (similarity > boost_treshold) { const auto l = std::min({source.length(), target.length(), prefix}); std::size_t common_prefix = 0; for (; common_prefix < l; ++common_prefix) { if (source[common_prefix] != target[common_prefix]) break; } return similarity + scaling_factor * common_prefix * (1 - similarity); } else { return similarity; } } } // namespace edit_distance // Test program #include <string> #include <iostream> static void show_distance(std::string_view a, std::string_view b) { std::cout << "Similarity for '" << a << "' and '" << b << "': " << edit_distance::jaro_winkler_similarity(a, b) << std::endl; } int main() { show_distance(std::string{"DWAYNE"}, "DUANE"); show_distance("MARTHA", "MARHTA"); show_distance("DIXON", "DICKSONX"); show_distance("JELLYFISH", "SMELLYFISH"); }