Using Unicode in C++ source code

Question

What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?

For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)

Furthermore, can I use Unicode for strings? For example:

Wstring str=L"Strange chars: âÂ Čšđ ě €€";

RE: "whatever it's called": From Wikipedia: The first plane, plane 0, the Basic Multilingual Plane (BMP) contains characters for almost all modern languages, and a large number of symbols. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing. Most of the assigned code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters. — DavidRR
– DavidRR, Commented Apr 22, 2015 at 14:41
I had an interesting variant. I had a UTF-8 character µ show as Âµ in my logs. I suspected GNU g++ assumed iso-8859-1 source and over-encoded the one-character two-byte sequence in the binary. Actually it understood source was UTF-8 based on locale. Log contained the correct two-byte sequence. Fact is, another part of the log contained stray bytes which introduced non-UTF-8 conformant byte sequences in the file. So, editor emacs figured out the file was most certainly actually ISO-8859-1 and showed two-byte characters as two separate characters. Fixing those stray bytes fixed the problem. — Stéphane Gourichon
– Stéphane Gourichon, Commented Oct 3, 2019 at 16:56

Johannes Schaub - litb · Accepted Answer · 2014-04-01 19:31:50Z

Encoding in C++ is quite a bit complicated. Here is my understanding of it.

Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal-character-names and look like \uffff or \Uffffffff and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).

This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

For gcc, you can change it using the option -finput-charset=charset. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset for char (it defaults to utf-8) and -fwide-exec-charset=charset (which defaults to either utf-16 or utf-32 depending on the size of wchar_t).

MSalters · Accepted Answer · 2008-12-03 15:03:03Z

12

In addition to litb's post, MSVC++ supports Unicode too. I understand it gets the Unicode encoding from the BOM. It definitely supports code like int (*♫)(); or const std::set<int> ∅; If you're really into code obfuscuation:

typedef void ‼; // Also known as \u203C class ooɟ { operator ‼() {} };

answered Dec 3, 2008 at 15:03

MSalters

182k11 gold badges171 silver badges376 bronze badges

1 Comment

simon.watts Over a year ago

This can be useful for writing, for example, mathematical software where the source code can be aligned to the source material. You can do this in Java, which accepts UTF-8 source code. However, for C++ (and C) there may be issues in how non-ASCII tokens are transformed into symbol names, which has to be compatible with the rest of the operating system - not just a feature of the compiler. For C++ this could be subsumed by name-mangling.

Head Geek · Accepted Answer · 2008-12-01 18:32:36Z

The C++ standard doesn't say anything about source-code file encoding, so far as I know.

The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.

EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:

warning C4819: The file contains a character that cannot be represented in the current code page (932). Save the file in Unicode format to prevent data loss.

It sort of does. I don't think it explicitly prevents or allows unicode, but this is the minimum allowable character set: csci.csusb.edu/dick/c++std/cd2/lex.html#lex.charset
Since C++Builder2007, the Borland/Codegear compiler has supported unicode source files: i.e. Unicode string literals, unicode comments. the IDe has struggled a bit with them, but the compiler's happy!
The Borland thing I mentioned was from roughly twenty years ago (the last time I tried putting a high-ASCII character in a source-code file). :-) I haven't used a Borland compiler in about ten years.
Microsoft compilers do support Unicode only for wide chars (L"...").

Max Lybbert · Accepted Answer · 2015-04-22 15:12:08Z

There are two issues at play here. The first is what characters are allowed in C++ code (and comments), such as variable names. The second is what characters are allowed in strings and string literals.

As noted, C++ compilers must support a very restricted ASCII-based character set for the characters allowed in code and comments. In practice, this character set didn't work very well with some European character sets (and especially with some European keyboards that didn't have a few characters -- like square brackets -- available), so the concept of digraphs and trigraphs was introduced. Many compilers accept more than this character set at this time, but there isn't any guarantee.

As for strings and string literals, C++ has the concept of a wide character and wide character string. However, the encoding for that character set is undefined. In practice it's almost always Unicode, but I don't think there's any guarantee here. Wide character string literals look like L"string literal", and these can be assigned to std::wstring's.

C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian.

Rob · Accepted Answer · 2008-12-01 18:26:42Z

For encoding in strings I think you are meant to use the \u notation, e.g.:

std::wstring str = L"\u20AC"; // Euro character

coppro · Accepted Answer · 2008-12-01 19:51:50Z

It's also worth noting that wide characters in C++ aren't really Unicode strings as such. They are just strings of larger characters, usually 16, but sometimes 32 bits. This is implementation-defined, though, IIRC you can have an 8-bit wchar_t You have no real guarantee as to the encoding in them, so if you are trying to do something like text processing, you will probably want a typedef to the most suitable integer type to your Unicode entity.

C++1x has additional unicode support in the form of UTF-8 encoding string literals (u8"text"), and UTF-16 and UTF-32 data types (char16_t and char32_t IIRC) as well as corresponding string constants (u"text" and U"text"). The encoding on characters specified without \uxxxx or \Uxxxxxxxx constants is still implementation-defined, though (and there is no encoding support for complex string types outside the literals)

jogojapan · Accepted Answer · 2012-08-23 04:42:30Z

In this context, if you get MSVC++ warning C4819, just change the source file coding to "UTF-8 with Bom".

GCC 4.1 doesn't support this, but GCC 4.4 does, and the latest Qt version uses GCC 4.4, so use "UTF-8 with Bom" as source file coding.

Klaim · Accepted Answer · 2008-12-01 18:27:16Z

AFAIK It's not standardized as you can put any type of characters in wide strings. You just have to check that your compiler is set to Unicode source code to make it work right.

Collectives™ on Stack Overflow

Using Unicode in C++ source code

8 Answers 8

Comments

1 Comment

4 Comments

Comments

Comments

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Comments

1 Comment

4 Comments

Comments

Comments

2 Comments

Comments

Comments

Linked

Related