Redesign the interface
You already have some doubts about your code's structure, in particular that findLongestSimilarSequences() has too many responsibilities. You want the function to just do the finding part, and move the loop that prints the longest sequences out of it. So ideally, you want to be able to write something like:
for (int repetitions = 2; repetitions <= 5; ++repetitions) { std::cout << "Longest equal sequence(s) repeated " << repetitions << " times:\n"; for (auto sequence: findLongestSimilarSequences(text, index, repetitions)) { std::cout << sequence << "\n"; } } So findLongestSimilarSequences() has to return something that can be iterated over. That can be as simple as a std::vector<std::string_view>, which is quite efficient. Consider that you already allocate a vector of size index.size() in your function.
std::vector<std::string_view> findLongestSimilarSequences(const std::string& text, const std::vector<int>& index, int min_repetitions) { // Build adjacent_matches … // Create the result std::vector<std::string_view> result; for (int i = 0; i < adjacent_matches.size() - 1; ++i) { if (adjacent_matches[i] == max_match_length) { result.emplace_back(text.data() + index[i] - (max_match_length - 1), max_match_length); } } return result; } But when looking at your code, I also have other questions about the interface:
- Why is your
textnot passed as aconstreference? - Why not use
std::string_viewfortext? - Why limit it to strings at all? It seems to me that you could make this work for any kind of range.
- Why does the caller need to pass a vector named
index? What if you pass the wrong "index"? Either letfindLongestSimilarSequencescalculate this index, or create aclassthat encapsulates both the text and index, and pass a reference to that to this function. - Why does the name say "SimilarSequences"? That implies they might not be exactly identical. I think "RepeatedSubsequences" would be a better name.
Keep code simple and clear
I had to read findMatchLength() a few times to understand why it works, as normally I would expect a function like this to compare to substrings in the forward direction. I guess it's smart to pass indices to the end of the substrings, so you can use the minimum of those to avoid out-of-bounds reads, and in this particular use case you don't care about finding the match length of the strings reversed.
A much more intuitive interface would be to just pass two substrings, and compare them forwards. Then you can also use STL algorithms such as std::mismatch(), or even better, its ranges version:
std::size_t findMatchLength(std::string_view str1, std::string_view str2) { return std::ranges::mismatch(str1, str2).in1 - str1.begin(); } The standard library might get an atomic max function
There have been proposals to get atomic max and min functions into the standard library. While your update_max() looks fine, consider renaming it and changing the interface to the proposed atomic_fetch_max(), which will have an interface very similar to std::atomic_fetch_add(). This would make refactoring your code later to make use of a standard function simpler.
template<typename T> T atomic_fetch_max(std::atomic<T>* obj, typename std::atomic<T>::value_type arg) noexcept { auto prev = obj->load(std::memory_order_relaxed); while (prev < arg && !obj->compare_exchange_weak(prev, arg)) {} return prev; } Your code can return duplicates
Try adding a dash at the end of your test string in main(). Your code doesn't find subsequences with an exact number of repetitions, but rather with a minimum number of repetitions. And when the actual number of repetitions found is larger than the minimum number of repetitions, it will print the same subsequence multiple times.
Other issues
- Use
"\n"instead ofstd::endl; the latter is equivalent to the former, but also forces the output to be flushed, which is usually unnecessary and has a negative impact on performance. - You use
intin some places to hold sizes, like the loop iteratoriand as the value type of the vectorindex. If the text is larger than can be represented by anint, your code will no longer work correctly. Always prefer to usestd::size_tfor sizes, counts and indices. - You can avoid the atomic variable by changing the
for_each()into atransform_reduce(). However, it probably doesn't improve performance in a significant way.