6

I have the following piece of code:

#include <iostream> std::string eps("ε"); int main() { std::cout << eps << '\n'; return 0; } 

Somehow it compiles with g++ and clang on Ubuntu, and even prints out right character ε. Also I have almost same piece of code which happily reads ε with cin into std::string. By the way, eps.size() is 2.

My question is - how that works? How can we insert unicode character into std::string? My guess is that operating system handles all this work with unicode, but I'm not sure.

EDIT

As with output, I understood that it is terminal who is responsible for showing me right character (ε in this case).

But with input: cin reads symbols to ' ' or any other space character (and as I understand byte by byte). So, if I take Ƞ, which second byte is 32 ' ' it will read only first byte, and then stop. But it reads Ƞ. How?

11
  • 3
    Maybe the editor you're using saves the file with UTF-8 encoding. Commented Dec 13, 2014 at 19:33
  • 2
    std::cout just sends a stream to the terminal. If your terminal handles UTF-8, this should work fine. Commented Dec 13, 2014 at 19:40
  • 2
    @SHR What does his string is not UNICODE but UTF-8 mean? Please stop spreading nonsense. Guess what, I'm typing in UNICODE right now. The explanations given by others above are correct. His editor saved the file in utf-8 and his terminal knows how to handle utf-8, so everything worked. This has nothing to do with wstring, which, by the way, doesn't know how to handle all of Unicode's complexities either. Commented Dec 13, 2014 at 20:09
  • 1
    Recommending to read utf8everywhere.org for clarification on encodings and usage of std::string. Commented Dec 14, 2014 at 21:12
  • 2
    When characters are encoded as UTF-8, they are not simply stored as their Unicode code point. For example, Ƞ is not stored as the hexadecimal bytes 02 20. Instead, they are encoded in a special UTF-8 format, which for Ƞ is C8 A0. Commented May 18, 2015 at 13:46

1 Answer 1

5

The most likely reason is that everything is getting encoded in UTF-8, as it does on my system:

$ xxd test.cpp ... 0000020: 2065 7073 2822 ceb5 2229 3b0a 0a69 6e74 eps("..");..int ^^^^ ε in UTF-8 ^^ TWO bytes! ... $ g++ -o test.out test.cpp $ ./test.out ε $ ./test.out | xxd 0000000: ceb5 0a ^^^^ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.