1

I believe the output has to do with UTF, but I do not know how. Would someone, please, explain?

#include <iostream> #include <cstdint> #include <iomanip> #include <string> int main() { std::cout << "sizeof(char) = " << sizeof(char) << std::endl; std::cout << "sizeof(std::string::value_type) = " << sizeof(std::string::value_type) << std::endl; std::string _s1 ("abcde"); std::cout << "s1 = " << _s1 << ", _s1.size() = " << _s1.size() << std::endl; std::string _s2 ("abcdé"); std::cout << "s2 = " << _s2 << ", _s2.size() = " << _s2.size() << std::endl; return 0; } 

The output is:

sizeof(char) = 1 sizeof(std::string::value_type) = 1 s1 = abcde, _s1.size() = 5 s2 = abcdé, _s2.size() = 6 

g++ --version prints g++ (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609

QTCreator compiles like this:

g++ -c -m32 -pipe -g -std=c++0x -Wall -W -fPIC -I../strsize -I. -I../../Qt/5.5/gcc/mkspecs/linux-g++-32 -o main.o ../strsize/main.cpp g++ -m32 -Wl,-rpath,/home/rodrigo/Qt/5.5/gcc -o strsize main.o 

Thanks a lot!

6
  • 1
    Try printing sizeof('é') and see what you get. Commented Aug 20, 2016 at 12:55
  • Thanks for your time. I added these two lines: std::cout << "sizeof('é') = " << sizeof('é') << std::endl; std::cout << "sizeof(\"é\") = " << sizeof("é") << std::endl; And the output was: sizeof('é') = 4 sizeof("é") = 3 Commented Aug 20, 2016 at 13:18
  • 1
    @canellas sizeof('é') is likely promoting the char to int, that would explain why its size is 4. A string literal "é" is equivilent to a const char[], so sizeof("é") is 3 because the é is encoded with 2 chars in UTF-8 (0xC3 0xA9) followed by the null terminator. Commented Aug 20, 2016 at 17:51
  • Thanks for your comments! I am still a bit lost... how can std::cout know that the bytes in position 5 e 6 of abcdé must be combined in a two byte value, before printing? Commented Aug 21, 2016 at 11:51
  • how can std::cout know that the bytes in position 5 e 6 of abcdé must be combined in a two byte value, before printing?: it doesn't. It blindly outputs the 6 bytes of the string, irrelevant of their content. Your console (ie. term/bash et all) is set to an UTF-8 environment and is displaying the appropriate glyph. See How to set up a clean UTF-8 environment in Linux. Commented Aug 23, 2016 at 18:35

3 Answers 3

4

é is encoded as 2 bytes, 0xC3 0xA9, in utf-8.

Sign up to request clarification or add additional context in comments.

Comments

4

gcc default input character set is UTF-8. Your editor also probably saved the file as UTF-8, so in your input .cpp file the string abcdé will have 6 bytes (As Peter already answered, the LATIN SMALL LETTER E WITH ACUTE is encoded in UTF-8 with 2 bytes). std::string::length returns the length in bytes, ie. 6. QED

You should open your source .cpp file in a hex editor to confirm.

Comments

3

Even in C++11 std::string has nothing to do with UTF-8. In the description of size and length methods of std::string we can see:

For std::string, the elements are bytes (objects of type char), which are not the same as characters if a multibyte encoding such as UTF-8 is used.

Thus, you should use some third-party unicode-compatible library to handle unicode strings.

If you continue to use non-unicode string classes with unicode strings, you may face LOTS of other problems. For example, you'll get a bogus result when trying to compare same-looking combining character and precomposed character.

1 Comment

This explains it very well. It's an irrelevant coincidence that UTF8 figures this at all.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.