Strange behavior of std::string with unicode

Question

I have the following piece of code:

#include <iostream> std::string eps("ε"); int main() { std::cout << eps << '\n'; return 0; }

Somehow it compiles with g++ and clang on Ubuntu, and even prints out right character ε. Also I have almost same piece of code which happily reads ε with cin into std::string. By the way, eps.size() is 2.

My question is - how that works? How can we insert unicode character into std::string? My guess is that operating system handles all this work with unicode, but I'm not sure.

EDIT

As with output, I understood that it is terminal who is responsible for showing me right character (ε in this case).

But with input: cin reads symbols to ' ' or any other space character (and as I understand byte by byte). So, if I take Ƞ, which second byte is 32 ' ' it will read only first byte, and then stop. But it reads Ƞ. How?

Maybe the editor you're using saves the file with UTF-8 encoding. — Captain Obvlious
– Captain Obvlious, Commented Dec 13, 2014 at 19:33
std::cout just sends a stream to the terminal. If your terminal handles UTF-8, this should work fine. — MrEricSir
– MrEricSir, Commented Dec 13, 2014 at 19:40
@SHR What does his string is not UNICODE but UTF-8 mean? Please stop spreading nonsense. Guess what, I'm typing in UNICODE right now. The explanations given by others above are correct. His editor saved the file in utf-8 and his terminal knows how to handle utf-8, so everything worked. This has nothing to do with wstring, which, by the way, doesn't know how to handle all of Unicode's complexities either. — Praetorian
– Praetorian, Commented Dec 13, 2014 at 20:09
Recommending to read utf8everywhere.org for clarification on encodings and usage of std::string. — Pavel Radzivilovsky
– Pavel Radzivilovsky, Commented Dec 14, 2014 at 21:12
When characters are encoded as UTF-8, they are not simply stored as their Unicode code point. For example, Ƞ is not stored as the hexadecimal bytes 02 20. Instead, they are encoded in a special UTF-8 format, which for Ƞ is C8 A0. — Lithis
– Lithis, Commented May 18, 2015 at 13:46

NPE · Accepted Answer · 2014-12-13 20:18:00Z

The most likely reason is that everything is getting encoded in UTF-8, as it does on my system:

$ xxd test.cpp ... 0000020: 2065 7073 2822 ceb5 2229 3b0a 0a69 6e74 eps("..");..int ^^^^ ε in UTF-8 ^^ TWO bytes! ... $ g++ -o test.out test.cpp $ ./test.out ε $ ./test.out | xxd 0000000: ceb5 0a ^^^^

Collectives™ on Stack Overflow

Strange behavior of std::string with unicode

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related