Why are unicode characters treated the same in C++ std::string?

Question

Here's an Ideone: http://ideone.com/vjByty.

#include <iostream> using namespace std; #include <string> int main() { string s = "\u0001\u0001"; cout << s.length() << endl; if (s[0] == s[1]) { cout << "equal\n"; } return 0; }

I'm confused on so many levels.

What does it mean when I type in an escaped Unicode string literal in my C++ program?

Shouldn't it take 4 bytes for 2 characters? (assuming utf-16)

Why are the first two characters of s (first two bytes) equal?

It is probably compiler and operating system specific. And also depending on the version of the C++ standard. BTW, your assumption "utf-16" is often false. — Basile Starynkevitch
– Basile Starynkevitch, Commented Feb 5, 2015 at 18:09
@BasileStarynkevitch not often false, always false. Unless you use the leading L on the string literal, then I suppose it's often. But that's not what we have here. — Mark Ransom
– Mark Ransom, Commented Feb 5, 2015 at 18:25
@MarkRansom Not necessarily always false. A platform could have 16 bit char, with the UTF-16 as the basic execution character set. (I don't know of any that do, but the standard definitely allows it.) — James Kanze
– James Kanze, Commented Feb 5, 2015 at 19:08
@JamesKanze yes, the standard is flexible enough to allow it. But without a concrete example, I'm sticking by my statement. — Mark Ransom
– Mark Ransom, Commented Feb 5, 2015 at 19:32

Shafik Yaghmour · Accepted Answer · 2015-02-05 18:31:34Z

So the draft C++11 standard says the following about universal characters in narrow string literals (emphasis mine going forward):

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (2.14.3), except that the single quote [...] In a narrow string literal, a universal-charactername may map to more than one char element due to multibyte encoding

and includes the following note:

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating ’\0’.

Section 2.14.3 referred to above says:

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding.

if I try this example (see it live):

string s = "\u0F01\u0001";

The first universal character does map to multiple characters.

Mike Seymour · Accepted Answer · 2015-02-05 18:32:30Z

What does it mean when I type in an escaped Unicode string literal in my C++ program?

To quote the standard:

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.

Typically, the execution character set will be ASCII, which contains a character with value 1. So \u0001 will be translated into a single character with value 1.

If you were to specify non-ASCII characters, like \u263A, you might see more than one byte per character.

Shouldn't it take 4 bytes for 2 characters? (assuming utf-16)

If it were UTF-16, yes. But string can't be encoded with UTF-16, unless char has 16 bits, which it usually doesn't. UTF-8 is a more likely encoding, in which characters with values up to 127 (that is, the whole ASCII set) are encoded with a single byte.

Why are the first two characters of s (first two bytes) equal?

With the above assumptions, they are both the character with value 1.

I don't know that UTF-8 is more likely. But even with other encodings (like ISO 8859-1), '\u0001' will translate into a single byte.
@JamesKanze: Indeed, if the execution character set includes 1 (which it will if it's ASCII or a superset), then \u0001 must translate to a single byte, as the answer says. Perhaps "more likely" was a poor choice of words, since apparently there are some quaint systems that still use 8-bit encodings. But I've no idea what they might do with arbitrary Unicode points, and don't really want to know.
I'm sorry I made a mistake in the Ideone. Here's one with string :\u0001\u0000. ideone.com/N6KkHt Both the characters are still treated as identical. They should both be treated as ascii and that means they should be treated as 1 and 0 respectively.
@batman: That's a GCC bug, \u0000 is incorrectly translated to 1. gcc.gnu.org/bugzilla/show_bug.cgi?id=53690

Collectives™ on Stack Overflow

Why are unicode characters treated the same in C++ std::string?

2 Answers 2

Comments

8 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

8 Comments

Related