Sorting UTF-8 strings?

Question

My std::strings are encoded in UTF-8 so the std::string < operator doesn't cut it. How could I compare 2 utf-8 encoded std::strings?

where it does not cut it is for accents, é comes after z which it should not

Thanks

Why doesn't the standard op< not "cut it"? What ordering do you want? — Lara Bailey
– Lara Bailey, Commented Jan 6, 2011 at 2:45
UTF-8-encoded strings sort in the same order as the equivalent UTF-32-encoded strings. — dan04
– dan04, Commented Jan 6, 2011 at 2:46
@Charles: I believe it doesn't "cut it" because that just performs a byte-by-byte comparison, and doesn't take into account accents, etc. — user541686
– user541686, Commented Jan 6, 2011 at 2:49
@Milo assuming you want lexicographic comparison by Unicode code point, I believe that UTF-8 is structured in such a way that lexicographic comparison of the UTF-8 bytes will give you the same result. — Laurence Gonsalves
– Laurence Gonsalves, Commented Jan 6, 2011 at 2:56
@Lambert: What do you mean by doesn't take into account accents? Do you mean that "small letter e" followed by "combining acute accent" should be sorted the same as "small letter e with acute accent" or that "small letter e" short sort the same as "small letter e with acute accent". If the former then you are talking about unicode normalization, if the later then you need locale aware collation. I was asking the original question asker because it wasn't clear what he wanted to use the sort for. operator< is suitable for a lot of use cases. — Lara Bailey
– Lara Bailey, Commented Jan 6, 2011 at 9:51

Greg Hewgill · Accepted Answer · 2011-01-06 03:40:49Z

6

If you don't want a lexicographic ordering (which is what sorting the UTF-8 encoded strings lexicographically will give you), then you will need to decode your UTF-8 encoded strings into UCS-2 or UCS-4 as appropriate, and apply a suitable comparison function of your choosing.

To reiterate the point, the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.

Update: Your updated question indicates that you want a more complex comparison function than purely a lexicographic sort. You will need to decode your UTF-8 strings and compare the decoded characters.

edited Jan 6, 2011 at 3:40

answered Jan 6, 2011 at 2:52

Greg Hewgill

1.0m192 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Eugene Yokota Over a year ago

Collation (sorting) and encoding are two completely separate issue unless you're treating them as byte arrays ANSI style. joelonsoftware.com/articles/Unicode.html

jmasterx Over a year ago

Yea but how do I compare them, is there a logical way to know that é comes before f and after e ?

dan04 Over a year ago

Depends on your locale. In German, ö sorts before p. In Swedish, the same letter sorts at the end of the alphabet.

jmasterx Over a year ago

@dan04 somehow, Windows succeeds at this for any locale

Lara Bailey Over a year ago

@Milo: In many languages 'é' does not come after 'e' it sorts the same so two words starting with these two letters sort based on what follows their initial letters. In some languages some accented letters sort differently from their unaccented counterparts and some languages have digrams that sort differently than the two characters that make up them would indicate. E.g. in Czech 'e' and 'ě' sort the same but 'č' sorts after 'c' and 'ch' sorts after 'h' (IIRC). See here: userguide.icu-project.org/collation and here unicode.org/reports/tr10 for more details.

|

ephemient · Accepted Answer · 2011-01-12 05:51:21Z

The standard has std::locale for locale-specific things such as collation (sorting). If the environment contains LC_COLLATE=en_US.utf8 or similar, this program will sort lines as desired.

#include <algorithm> #include <functional> #include <iostream> #include <iterator> #include <locale> #include <string> #include <vector> class collate_in : public std::binary_function<std::string, std::string, bool> { protected: const std::collate<char> &coll; public: collate_in(std::locale loc) : coll(std::use_facet<std::collate<char> >(loc)) {} bool operator()(const std::string &a, const std::string &b) const { // std::collate::compare() takes C-style string (begin, end)s and // returns values like strcmp or strcoll. Compare to 0 for results // expected for a less<>-style comparator. return coll.compare(a.c_str(), a.c_str() + a.size(), b.c_str(), b.c_str() + b.size()) < 0; } }; int main() { std::vector<std::string> v; copy(std::istream_iterator<std::string>(std::cin), std::istream_iterator<std::string>(), back_inserter(v)); // std::locale("") is the locale from the environment. One could also // std::locale::global(std::locale("")) to set up this program's global // first, and then use locale() to get the global locale, or choose a // specific locale instead of using the environment's. sort(v.begin(), v.end(), collate_in(std::locale(""))); copy(v.begin(), v.end(), std::ostream_iterator<std::string>(std::cout, "\n")); return 0; }

 $ cat >file f é e d ^D $ LC_COLLATE=C ./a.out file d e f é $ LC_COLLATE=en_US.utf8 ./a.out file d e é f

It's been brought to my attention that std::locale::operator()(a, b) exists, obviating the std::collate<>::compare(a, b) < 0 wrapper I wrote above.

#include <algorithm> #include <iostream> #include <iterator> #include <locale> #include <string> #include <vector> int main() { std::vector<std::string> v; copy(std::istream_iterator<std::string>(std::cin), std::istream_iterator<std::string>(), back_inserter(v)); sort(v.begin(), v.end(), std::locale("")); copy(v.begin(), v.end(), std::ostream_iterator<std::string>(std::cout, "\n")); return 0; }

Community · Accepted Answer · 2017-05-23 12:03:50Z

Encoding (UTF-8, 16, etc) isn't the problem, it's whether the container itself is treating the string as Unicode string or 8-bit (ASCII or Latin-1) string that matters.

I found Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library, which could help you.

Miguel Garcia · Accepted Answer · 2015-11-04 09:08:43Z

One option would be to use ICU collators (http://userguide.icu-project.org/collation/api) which provide a properly internationalized "compare" method that you can then use to sort.

Chromium has a small wrapper that should be easy to copy&paste/reuse

https://code.google.com/p/chromium/codesearch#chromium/src/base/i18n/string_compare.cc&sq=package:chromium&type=cs

int main · Accepted Answer · 2025-01-18 17:12:58Z

Simplified (and more working) version of solution by @ephemient

#include <algorithm> #include <iostream> #include <locale> #include <string> #include <array> int main() { #ifdef _WIN32 // Make win console UTF-8 // Also `/utf-8` flag should be passed to msvc frakin compiler // Also source code file should be saved with utf-8 codepage system("chcp 65001"); #endif // Repetitive construction (or short lifespan) makes the program to hang/stroke // for me, so `std::locale` might should be more survivable (global) const std::locale loc("uk_UA.utf8"); // "en_US.utf8" // "<your_code_page>" // Array of `const char *` // deduction guide for array creation (since C++17) // Important to store utf as string since utf-8 char length > 1 for non-ENG letters std::array ustrs { "f", "é", "e", "d", "в", "а", "д", "г", "б", "ї", "і", }; std::sort(ustrs.begin(), ustrs. end(), [&loc](const std::string &l, const std::string &r) { return loc.operator()(l, r); }); for (const auto &us : ustrs) std::cout << us << ' '; // us - utf string }

Active code page: 65001 d e é f а б в г д і ї

Collectives™ on Stack Overflow

Sorting UTF-8 strings?

5 Answers 5

8 Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

8 Comments

Comments

Comments

Comments

Comments

Linked

Related