Return to Revisions

4 of 4

added 32 characters in body

edited Jan 20, 2024 at 14:52

69.5k
3
75
180

Redesign the interface

You already have some doubts about your code's structure, in particular that findLongestSimilarSequences() has too many responsibilities. You want the function to just do the finding part, and move the loop that prints the longest sequences out of it. So ideally, you want to be able to write something like:

for (int repetitions = 2; repetitions <= 5; ++repetitions) { std::cout << "Longest equal sequence(s) repeated " << repetitions << " times:\n"; for (auto sequence: findLongestSimilarSequences(text, index, repetitions)) { std::cout << sequence << "\n"; } }

So findLongestSimilarSequences() has to return something that can be iterated over. That can be as simple as a std::vector<std::string_view>, which is quite efficient. Consider that you already allocate a vector of size index.size() in your function.

std::vector<std::string_view> findLongestSimilarSequences(const std::string& text, const std::vector<int>& index, int min_repetitions) { // Build adjacent_matches … // Create the result std::vector<std::string_view> result; for (int i = 0; i < adjacent_matches.size() - 1; ++i) { if (adjacent_matches[i] == max_match_length) { result.emplace_back(text.data() + index[i] - (max_match_length - 1), max_match_length); } } return result; }

But when looking at your code, I also have other questions about the interface:

Why is your text not passed as a const reference?
Why not use std::string_view for text?
Why limit it to strings at all? It seems to me that you could make this work for any kind of range.
Why does the caller need to pass a vector named index? What if you pass the wrong "index"? Either let findLongestSimilarSequences calculate this index, or create a class that encapsulates both the text and index, and pass a reference to that to this function.
Why does the name say "SimilarSequences"? That implies they might not be exactly identical. I think "RepeatedSubsequences" would be a better name.

Keep code simple and clear

I had to read findMatchLength() a few times to understand why it works, as normally I would expect a function like this to compare to substrings in the forward direction. I guess it's smart to pass indices to the end of the substrings, so you can use the minimum of those to avoid out-of-bounds reads, and in this particular use case you don't care about finding the match length of the strings reversed.

A much more intuitive interface would be to just pass two substrings, and compare them forwards. Then you can also use STL algorithms such as std::mismatch(), or even better, its ranges version:

std::size_t findMatchLength(std::string_view str1, std::string_view str2) { return std::ranges::mismatch(str1, str2).in1 - str1.begin(); }

The standard library might get an atomic max function

There have been proposals to get atomic max and min functions into the standard library. While your update_max() looks fine, consider renaming it and changing the interface to the proposed atomic_fetch_max(), which will have an interface very similar to std::atomic_fetch_add(). This would make refactoring your code later to make use of a standard function simpler.

template<typename T> T atomic_fetch_max(std::atomic<T>* obj, typename std::atomic<T>::value_type arg) noexcept { auto prev = obj->load(std::memory_order_relaxed); while (prev < arg && !obj->compare_exchange_weak(prev, arg)) {} return prev; }

Your code can return duplicates

Try adding a dash at the end of your test string in main(). Your code doesn't find subsequences with an exact number of repetitions, but rather with a minimum number of repetitions. And when the actual number of repetitions found is larger than the minimum number of repetitions, it will print the same subsequence multiple times.

Other issues

Use "\n" instead of std::endl; the latter is equivalent to the former, but also forces the output to be flushed, which is usually unnecessary and has a negative impact on performance.
You use int in some places to hold sizes, like the loop iterator i and as the value type of the vector index. If the text is larger than can be represented by an int, your code will no longer work correctly. Always prefer to use std::size_t for sizes, counts and indices.
You can avoid the atomic variable by changing the for_each() into a transform_reduce(). However, it probably doesn't improve performance in a significant way.

answered Jan 17, 2024 at 16:18

G. Sliepen

69.5k
3
75
180