std::string::size() strange behaviour

Question

I believe the output has to do with UTF, but I do not know how. Would someone, please, explain?

#include <iostream> #include <cstdint> #include <iomanip> #include <string> int main() { std::cout << "sizeof(char) = " << sizeof(char) << std::endl; std::cout << "sizeof(std::string::value_type) = " << sizeof(std::string::value_type) << std::endl; std::string _s1 ("abcde"); std::cout << "s1 = " << _s1 << ", _s1.size() = " << _s1.size() << std::endl; std::string _s2 ("abcdé"); std::cout << "s2 = " << _s2 << ", _s2.size() = " << _s2.size() << std::endl; return 0; }

The output is:

sizeof(char) = 1 sizeof(std::string::value_type) = 1 s1 = abcde, _s1.size() = 5 s2 = abcdé, _s2.size() = 6

g++ --version prints g++ (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609

QTCreator compiles like this:

g++ -c -m32 -pipe -g -std=c++0x -Wall -W -fPIC -I../strsize -I. -I../../Qt/5.5/gcc/mkspecs/linux-g++-32 -o main.o ../strsize/main.cpp g++ -m32 -Wl,-rpath,/home/rodrigo/Qt/5.5/gcc -o strsize main.o

Thanks a lot!

Thanks for your time. I added these two lines: std::cout << "sizeof('é') = " << sizeof('é') << std::endl; std::cout << "sizeof(\"é\") = " << sizeof("é") << std::endl; And the output was: sizeof('é') = 4 sizeof("é") = 3 — canellas
– canellas, Commented Aug 20, 2016 at 13:18
@canellas sizeof('é') is likely promoting the char to int, that would explain why its size is 4. A string literal "é" is equivilent to a const char[], so sizeof("é") is 3 because the é is encoded with 2 chars in UTF-8 (0xC3 0xA9) followed by the null terminator. — Remy Lebeau
– Remy Lebeau, Commented Aug 20, 2016 at 17:51
Thanks for your comments! I am still a bit lost... how can std::cout know that the bytes in position 5 e 6 of abcdé must be combined in a two byte value, before printing? — canellas
– canellas, Commented Aug 21, 2016 at 11:51
how can std::cout know that the bytes in position 5 e 6 of abcdé must be combined in a two byte value, before printing?: it doesn't. It blindly outputs the 6 bytes of the string, irrelevant of their content. Your console (ie. term/bash et all) is set to an UTF-8 environment and is displaying the appropriate glyph. See How to set up a clean UTF-8 environment in Linux. — Remus Rusanu
– Remus Rusanu, Commented Aug 23, 2016 at 18:35

Peter Skarpetis · Accepted Answer · 2016-08-20 12:47:07Z

4

é is encoded as 2 bytes, 0xC3 0xA9, in utf-8.

answered Aug 20, 2016 at 12:47

Peter Skarpetis

5533 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Remus Rusanu · Accepted Answer · 2016-08-20 12:57:05Z

gcc default input character set is UTF-8. Your editor also probably saved the file as UTF-8, so in your input .cpp file the string abcdé will have 6 bytes (As Peter already answered, the LATIN SMALL LETTER E WITH ACUTE is encoded in UTF-8 with 2 bytes). std::string::length returns the length in bytes, ie. 6. QED

You should open your source .cpp file in a hex editor to confirm.

Sergey · Accepted Answer · 2016-08-20 12:57:24Z

Even in C++11 std::string has nothing to do with UTF-8. In the description of size and length methods of std::string we can see:

For std::string, the elements are bytes (objects of type char), which are not the same as characters if a multibyte encoding such as UTF-8 is used.

Thus, you should use some third-party unicode-compatible library to handle unicode strings.

If you continue to use non-unicode string classes with unicode strings, you may face LOTS of other problems. For example, you'll get a bogus result when trying to compare same-looking combining character and precomposed character.

This explains it very well. It's an irrelevant coincidence that UTF8 figures this at all.

Collectives™ on Stack Overflow

std::string::size() strange behaviour

3 Answers 3

Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Related