My std::strings are encoded in UTF-8 so the std::string < operator doesn't cut it. How could I compare 2 utf-8 encoded std::strings?
where it does not cut it is for accents, é comes after z which it should not
Thanks
If you don't want a lexicographic ordering (which is what sorting the UTF-8 encoded strings lexicographically will give you), then you will need to decode your UTF-8 encoded strings into UCS-2 or UCS-4 as appropriate, and apply a suitable comparison function of your choosing.
To reiterate the point, the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.
Update: Your updated question indicates that you want a more complex comparison function than purely a lexicographic sort. You will need to decode your UTF-8 strings and compare the decoded characters.
The standard has std::locale for locale-specific things such as collation (sorting). If the environment contains LC_COLLATE=en_US.utf8 or similar, this program will sort lines as desired.
#include <algorithm> #include <functional> #include <iostream> #include <iterator> #include <locale> #include <string> #include <vector> class collate_in : public std::binary_function<std::string, std::string, bool> { protected: const std::collate<char> &coll; public: collate_in(std::locale loc) : coll(std::use_facet<std::collate<char> >(loc)) {} bool operator()(const std::string &a, const std::string &b) const { // std::collate::compare() takes C-style string (begin, end)s and // returns values like strcmp or strcoll. Compare to 0 for results // expected for a less<>-style comparator. return coll.compare(a.c_str(), a.c_str() + a.size(), b.c_str(), b.c_str() + b.size()) < 0; } }; int main() { std::vector<std::string> v; copy(std::istream_iterator<std::string>(std::cin), std::istream_iterator<std::string>(), back_inserter(v)); // std::locale("") is the locale from the environment. One could also // std::locale::global(std::locale("")) to set up this program's global // first, and then use locale() to get the global locale, or choose a // specific locale instead of using the environment's. sort(v.begin(), v.end(), collate_in(std::locale(""))); copy(v.begin(), v.end(), std::ostream_iterator<std::string>(std::cout, "\n")); return 0; } $ cat >file f é e d ^D $ LC_COLLATE=C ./a.out file d e f é $ LC_COLLATE=en_US.utf8 ./a.out file d e é f
It's been brought to my attention that std::locale::operator()(a, b) exists, obviating the std::collate<>::compare(a, b) < 0 wrapper I wrote above.
#include <algorithm> #include <iostream> #include <iterator> #include <locale> #include <string> #include <vector> int main() { std::vector<std::string> v; copy(std::istream_iterator<std::string>(std::cin), std::istream_iterator<std::string>(), back_inserter(v)); sort(v.begin(), v.end(), std::locale("")); copy(v.begin(), v.end(), std::ostream_iterator<std::string>(std::cout, "\n")); return 0; } Encoding (UTF-8, 16, etc) isn't the problem, it's whether the container itself is treating the string as Unicode string or 8-bit (ASCII or Latin-1) string that matters.
I found Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library, which could help you.
One option would be to use ICU collators (http://userguide.icu-project.org/collation/api) which provide a properly internationalized "compare" method that you can then use to sort.
Chromium has a small wrapper that should be easy to copy&paste/reuse
Simplified (and more working) version of solution by @ephemient
#include <algorithm> #include <iostream> #include <locale> #include <string> #include <array> int main() { #ifdef _WIN32 // Make win console UTF-8 // Also `/utf-8` flag should be passed to msvc frakin compiler // Also source code file should be saved with utf-8 codepage system("chcp 65001"); #endif // Repetitive construction (or short lifespan) makes the program to hang/stroke // for me, so `std::locale` might should be more survivable (global) const std::locale loc("uk_UA.utf8"); // "en_US.utf8" // "<your_code_page>" // Array of `const char *` // deduction guide for array creation (since C++17) // Important to store utf as string since utf-8 char length > 1 for non-ENG letters std::array ustrs { "f", "é", "e", "d", "в", "а", "д", "г", "б", "ї", "і", }; std::sort(ustrs.begin(), ustrs. end(), [&loc](const std::string &l, const std::string &r) { return loc.operator()(l, r); }); for (const auto &us : ustrs) std::cout << us << ' '; // us - utf string } Active code page: 65001 d e é f а б в г д і ї
op<not "cut it"? What ordering do you want?operator<is suitable for a lot of use cases.