How to print Unicode character in C++

Question

I am trying to print a Russian "ф" (U+0444 CYRILLIC SMALL LETTER EF) character, which is given a code of decimal 1092. Using C++, how can I print out this character? I would have thought something along the lines of the following would work, yet...

int main (){ wchar_t f = '1060'; cout << f << endl; }

Note that the problem is two-fold (at least when it comes to a valid C++ program): expressing the character in code, and correctly passing it to std::cout. (And even when those two steps are done correctly it's a different matter altogether of correctly displaying the character inside whatever std::cout is connected to.) — Luc Danton
– Luc Danton, Commented Aug 18, 2012 at 4:46
Does this answer your question? Unicode encoding for string literals in C++11 — M.J. Rayburn
– M.J. Rayburn, Commented Jun 24, 2021 at 2:33

bames53 · Accepted Answer · 2014-10-31 20:58:16Z

To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.

// both of these assume that the character can be represented with // a single char in the execution encoding char b = '\u0444'; char a = 'ф'; // this line additionally assumes that the source character encoding supports this character

Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:

#include <iostream> int main() { std::cout << "Hello, ф or \u0444!\n"; }

This program does not require that 'ф' can be represented in a single char. On OS X and most any modern Linux install this will work just fine, because the source, execution, and console encodings will all be UTF-8 (which supports all Unicode characters).

Things are harder with Windows and there are different possibilities with different tradeoffs.

Probably the best, if you don't need portable code (you'll be using wchar_t, which should really be avoided on every other platform), is to set the mode of the output file handle to take only UTF-16 data.

#include <iostream> #include <io.h> #include <fcntl.h> int main() { _setmode(_fileno(stdout), _O_U16TEXT); std::wcout << L"Hello, \u0444!\n"; }

Portable code is more difficult.

? I'm pretty sure '\u0444' won't fit into a char unless the compiler has promoted the char to an int, but if you want that behavior, you should use an int.
@EdwardFalk \u0444 will fit in an 8 bit char if the execution charset is, for example, ISO-8859-5. Specifically it will be the byte 0xE4. Note that I'm not suggesting that using such an execution charset is a good practice, I'm simply describing how C++ works.
Ahhh, you're saying the compiler will recognize \u0444 as a unicode character, and convert it to the prevailing character set, and the result will fit in a byte? I didn't know it would do that.
doesn't work on my lubuntu 16 laptop with terminator terminal and g++ 5.4.0, using a std::string worked though

James Raitsev · Accepted Answer · 2012-08-18 16:19:28Z

20

When compiling with -std=c++11, one can simply

 const char *s = u8"\u0444"; cout << s << endl;

answered Aug 18, 2012 at 16:19

James Raitsev

97.3k159 gold badges346 silver badges480 bronze badges

3 Comments

Yakov Galka Over a year ago

Let me recommend Boost.Nowide for printing UTF-8 strings to terminal in a portable way, so the above code will be almost unchanged.

Jorge Leitao Over a year ago

@ybungalobill, your comment deserves an answer on its own. Would you mind creating one?

ynn Over a year ago

Just for my note: \uXXXX and \UXXXXXXXX are called universal-character-name. A string literal of the form u8"..." is UTF-8 string literal. Both are specified in the standard.

Puppy · Accepted Answer · 2012-08-18 03:26:49Z

12

Ultimately, this is completely platform-dependent. Unicode-support is, unfortunately, very poor in Standard C++. For GCC, you will have to make it a narrow string, as they use UTF-8, and Windows wants a wide string, and you must output to wcout.

// GCC std::cout << "ф"; // Windoze wcout << L"ф";

answered Aug 18, 2012 at 3:26

Puppy

147k40 gold badges271 silver badges480 bronze badges

9 Comments

Mike DeSimone Over a year ago

IIRC, Unicode escapes are \uXXXX where the XXXX is for hex digits. Unfortunately, this leaves all the characters past U+FFFF out.

Billy ONeal Over a year ago

@Mike: If you want past FFFF, you can do so by generating a UTF-16 surrogate pair yourself using two instances of \u, at least on windows.

bames53 Over a year ago

@BillyONeal You do not use surrogate code points in C++ (in fact surrogate code points are completely prohibited). You use the format \UXXXXXXXX.

Luc Danton Over a year ago

GCC is not bound to use UTF-8, and is available for Windows. std::wcout is also an option outside of Windows.

curiousguy Over a year ago

@Jam '\u0400' is a narrow-character literal. You seem to assume that \u0400 exists in the execution character set. According to N3242 [lex.ccon]/5: "A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding."

|

Peter Mortensen · Accepted Answer · 2023-05-16 14:13:16Z

This code works in Linux (C++11, Geany, and GCC 7.4 (g++. 2018-12-06)):

#include <iostream> using namespace std; int utf8_to_unicode(string utf8_code); string unicode_to_utf8(int unicode); int main() { cout << unicode_to_utf8(36) << '\t'; cout << unicode_to_utf8(162) << '\t'; cout << unicode_to_utf8(8364) << '\t'; cout << unicode_to_utf8(128578) << endl; cout << unicode_to_utf8(0x24) << '\t'; cout << unicode_to_utf8(0xa2) << '\t'; cout << unicode_to_utf8(0x20ac) << '\t'; cout << unicode_to_utf8(0x1f642) << endl; cout << utf8_to_unicode("$") << '\t'; cout << utf8_to_unicode("¢") << '\t'; cout << utf8_to_unicode("€") << '\t'; cout << utf8_to_unicode("🙂") << endl; cout << utf8_to_unicode("\x24") << '\t'; cout << utf8_to_unicode("\xc2\xa2") << '\t'; cout << utf8_to_unicode("\xe2\x82\xac") << '\t'; cout << utf8_to_unicode("\xf0\x9f\x99\x82") << endl; return 0; } int utf8_to_unicode(string utf8_code) { unsigned utf8_size = utf8_code.length(); int unicode = 0; for (unsigned p=0; p<utf8_size; ++p) { int bit_count = (p? 6: 8 - utf8_size - (utf8_size == 1? 0: 1)), shift = (p < utf8_size - 1? (6*(utf8_size - p - 1)): 0); for (int k=0; k<bit_count; ++k) unicode += ((utf8_code[p] & (1 << k)) << shift); } return unicode; } string unicode_to_utf8(int unicode) { string s; if (unicode>=0 and unicode <= 0x7f) // 7F(16) = 127(10) { s = static_cast<char>(unicode); return s; } else if (unicode <= 0x7ff) // 7FF(16) = 2047(10) { unsigned char c1 = 192, c2 = 128; for (int k=0; k<11; ++k) { if (k < 6) c2 |= (unicode % 64) & (1 << k); else c1 |= (unicode >> 6) & (1 << (k - 6)); } s = c1; s += c2; return s; } else if (unicode <= 0xffff) // FFFF(16) = 65535(10) { unsigned char c1 = 224, c2 = 128, c3 = 128; for (int k=0; k<16; ++k) { if (k < 6) c3 |= (unicode % 64) & (1 << k); else if (k < 12) c2 |= (unicode >> 6) & (1 << (k - 6)); else c1 |= (unicode >> 12) & (1 << (k - 12)); } s = c1; s += c2; s += c3; return s; } else if (unicode <= 0x1fffff) // 1FFFFF(16) = 2097151(10) { unsigned char c1 = 240, c2 = 128, c3 = 128, c4 = 128; for (int k=0; k<21; ++k) { if (k < 6) c4 |= (unicode % 64) & (1 << k); else if (k < 12) c3 |= (unicode >> 6) & (1 << (k - 6)); else if (k < 18) c2 |= (unicode >> 12) & (1 << (k - 12)); else c1 |= (unicode >> 18) & (1 << (k - 18)); } s = c1; s += c2; s += c3; s += c4; return s; } else if (unicode <= 0x3ffffff) // 3FFFFFF(16) = 67108863(10) { ; // Actually, there are no 5-bytes unicodes } else if (unicode <= 0x7fffffff) // 7FFFFFFF(16) = 2147483647(10) { ; // Actually, there are no 6-bytes unicodes } else ; // Incorrect unicode (< 0 or > 2147483647) return ""; }

More:

Peter Mortensen · Accepted Answer · 2023-05-16 13:58:54Z

If you use Windows (note, we are using printf(), not cout):

// Save as UTF-8 without a signature #include <stdio.h> #include<windows.h> int main (){ SetConsoleOutputCP(65001); printf("ф\n"); }

It is not Unicode, but it is working—Windows-1251 instead of UTF-8:

// Save as Windows 1251 #include <iostream> #include<windows.h> using namespace std; int main (){ SetConsoleOutputCP(1251); cout << "ф" << endl; }

Just FYI: default cyrillic console encoding in Windows is OEM 866.
I had to use - SetConsoleOutputCP(CP_UTF8); and printf(u8"Привет мир\n");

vitaut · Accepted Answer · 2023-12-10 18:57:30Z

In C++23 you'll be able to do it with std::print which supports Unicode:

#include <print> int main () { std::print("{}\n", "ф"); }

Until std::print is widely available you can use the open-source {fmt} library, it is based on (godbolt):

#include <fmt/core.h> int main () { fmt::print("{}\n", "ф"); }

Just make sure that your literal encoding is UTF-8 which is normally the default on most platforms and on Windows/MSVC is enabled with /utf-8.

I wouldn't recommend using wchar_t because it requires nonportable APIs such as _setmode.

Disclaimer: I'm the author of {fmt} and C++23 std::print.

Could you help with this possible std::print issue? It seems to work until ... it doesn't. stackoverflow.com/questions/79779176/…
This looks like a bug in related to buffering. I recommend reporting to Microsoft STL maintainers.

Mike DeSimone · Accepted Answer · 2012-08-18 03:37:28Z

2

'1060' is four characters, and won't compile under the standard. You should just treat the character as a number, if your wide characters match 1:1 with Unicode (check your locale settings).

int main (){ wchar_t f = 1060; wcout << f << endl; }

edited Aug 18, 2012 at 3:37

answered Aug 18, 2012 at 3:28

Mike DeSimone

43k10 gold badges79 silver badges97 bronze badges

12 Comments

Mike DeSimone Over a year ago

I thought that was one of the points of iostreams: it would detect the type via overloaded operator << and Do The Right Thing. Not so much, I guess?

Mark Ransom Over a year ago

@Jam much of this is system dependent. What OS are you using?

bames53 Over a year ago

'1060' is a multi-char character literal of type int, and is entirely legal under standard C++. It's value is implementation defined though. Most implementations will take the values of the characters and concatenate them to produce a single integral value. These are sometimes used for so-called 'FourCC's.

bames53 Over a year ago

Perhaps you'd be surprised how many warnings there are for entirely legal code. The C++ standard says "An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value." [lex.ccon] 2.14.3/1

curiousguy Over a year ago

@MikeDeSimone "every non-Mac compiler I've used emitted at least a warning" because it is 1) almost never used on purpose on non-Mac systems 2) not a portable construct

|

Peter Mortensen · Accepted Answer · 2023-05-16 14:03:27Z

I needed to show the string in the UI as well as save that to an XML configuration file. The above specified format is good for string in c++, I would add we can have the xml compatible string for the special character by replacing "\u" by "&#x" and adding a ";" at the end.

For example:

C++: "\u0444" → XML : "ф"

quanta · Accepted Answer · 2017-01-09 10:59:26Z

In Linux, I can just do:

std::cout << "ф";

I just copy-pasted characters from here and it didn't fail for at least the random sample that I tried on.

VoyciecH · Accepted Answer · 2018-12-06 11:54:06Z

Another solution in Linux:

string a = "Ф"; cout << "Ф = \xd0\xa4 = " << hex << int(static_cast<unsigned char>(a[0])) << int(static_cast<unsigned char>(a[1])) << " (" << a.length() << "B)" << endl; string b = "√"; cout << "√ = \xe2\x88\x9a = " << hex << int(static_cast<unsigned char>(b[0])) << int(static_cast<unsigned char>(b[1])) << int(static_cast<unsigned char>(b[2])) << " (" << b.length() << "B)" << endl;

Andrew · Accepted Answer · 2020-09-14 06:44:13Z

Special thanks to the answer here for more-or-less the same question.

For me, all I needed was setlocale(LC_ALL, "en_US.UTF-8");

Then, I could use even raw wchar_t characters.

Cecilia Colley · Accepted Answer · 2024-12-11 20:13:58Z

I wrote a whole bluesky thread about this: https://bsky.app/profile/cecisharp.bsky.social/post/3ld2bpp5qj22h

Here's the fix and the short explanation:

 locale::global(locale("en_US.UTF-8")); wcout.imbue(locale("en_US.UTF-8")); // Console output wcout << L"Unicode character: \u03C6 (φ)" << endl; // File output wofstream outFile("output.txt"); outFile.imbue(locale("en_US.UTF-8")); outFile << L"Unicode character: \u03C6 (φ)" << endl; outFile.close(); return 0;

Stick that in your main. Remember to add this at the top:

#include <fstream>

Run the code. A file will be created amongst your source file. Check if the file correctly reads: "Unicode character: φ (φ)"

If the file reads with the correct symbol, but the console doesn't show you the correct symbol, then the code is correct, but you have a problem with your console. if the opposite is true, then the problem is with your code... For that, read the longer answer.

Open your command prompt and run this command: chcp 65001

This is just setting your codepage to 65001, however, I don't actually think this is the problem. I think you probably are seeing a character that isn't a question mark, but isn't the symbol you want either.

Some console fonts, like Consolas, use glyph substitution for characters they don’t fully support, so you might be seeing a fallback glyph for the Russian character.

Older Windows consoles (like Command Prompt) don’t fully support all Unicode characters, even with UTF-8 enabled.

To show special characters you need to:

Have a font that supports the character.
Save the character as a wide character if it's not part of asciis normal range.
Make sure your locale understands that character.

Now here's the long answer:

Why do some characters print on the console, and others don’t? Your immediate thought might be, "Oh, the console’s font doesn’t support those characters." And yeah, that makes sense. Except... it’s not always true.

For example, the default font for most Windows consoles is Consolas. If you open the Character Map, you’ll see that Consolas supports a ton of characters. Including the square symbol, ■. So... why isn’t it showing up?

Your next guess might be, "Maybe it’s because I’m using an extended ASCII character, and I need to declare it as a wide character." Hmm. Nope, that didn’t work either.

Okay, forget ASCII for a second. What if we assign the character using its Unicode code? Hmm... still nothing.

Fine. What if we skip all that and just look up the ASCII value for the character, assign that number to a char, and print it that way? Oh, now it works! Why?

Well, the answer involves bytes, encoding, and how your program interprets text. Let’s break it down.

Why Assigning the Number Directly Works

When you assign a char like this:

char ch = 254; cout << ch;

It works. Why? Because a char in C takes up exactly 1 byte—that’s 8 bits. And 254 fits perfectly into those 8 bits.

Here’s what happens:

You assign 254 to the char. Internally, the program stores it as the binary value 11111110. The console reads this byte, looks it up in its active code page (like CP437), and renders it as ■. This works because there’s no interpretation or decoding involved. You’re giving the program exactly what it needs, so it just prints the symbol without any fuss.

But what about this code?

char ch = '■'; cout << ch;

Why doesn’t that work? After all, it’s the same character, right? Well, here’s where encoding comes into play.

Remember that our code is nothing more than a text file, that we're giving to some IDE to translate into binary. The encoding we use to save our source file will determine how that translation is done.

Encoding is essentially the "translation system" that tells the computer how to interpret, store, and display text symbols. It’s important because most of what we see on a computer screen is text. You’ll even see it when saving something in notepad... And since our source file is nothing more than a text file at the end of the day, we also save it with a specific encoding.

Most people probably encode their source files as UTF-8 without even knowing it. This is the standard. So, what is UTF-8 encoding? Well it’s short for "Unicode Transformation Format - 8-bit") and it’s a variable-length character encoding.

Basically it’s a kind of encoding that understands all Unicode symbols, and stores them in variables of different lengths.

Can you see where I’m going with this? In C, a character is always only one byte. But with UTF-8 encoding, characters can have varying lengths. In fact, with UTF-8 encoding, characters in the ASCII range (0–127) are encoded in 1 byte and have the same binary values as ASCII while less common characters, like our square, use 2–4 bytes.

So when we write this code here:

char ch = L'■'; cout << ch;

... and save the source file with UTF-8 encoding, then run the program, we end up trying to fit multiple bytes into one byte, which the program realizes isn’t gonna work, and defaults to a question mark.

Alright, so what if we use a wchar_t instead? Like this:

wchar_t ch = L'■'; wcout << ch;

That gives wchar_t enough space to store the character, so it should work, right? Nope. Not yet.

The issue here isn’t the storage space—it’s the locale.

By default, C++ uses the "C" locale. This is a minimal locale that only understands basic ASCII characters. It doesn’t know what ■ is, even if you’ve stored it correctly.

To fix this, you need to tell your program to use a locale that understands Unicode. For example:

locale::global(locale("en_US.UTF-8")); wchar_t ch = L'■'; wcout << ch;

This one will work.

With this line, you’re switching to the English (US) locale with UTF-8 encoding, which can handle Unicode characters. Now the program knows how to interpret L'■' and display it properly.

So, let’s go back to everything we tried:

Assigning the Number Directly: Worked because we skipped all encoding and just gave the program the byte 254. The console knew how to render it.

Using a Literal: Failed because the source file was saved as UTF-8. The program couldn’t fit the 3-byte UTF-8 sequence for ■ into a single char.

Using a Wide Character: Failed until we set the locale. Even though wchar_t could store the character, the default "C" locale didn’t understand Unicode.

Setting the Locale: Worked because it allowed the program to interpret wide characters as Unicode.

okidoki, I literally added a step by step to trouble shoot it at the top of everything, and also it's not a copy paste

Collectives™ on Stack Overflow

How to print Unicode character in C++

12 Answers 12

5 Comments

3 Comments

9 Comments

Comments

3 Comments

2 Comments

12 Comments

Comments

Comments

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

5 Comments

3 Comments

9 Comments

Comments

3 Comments

2 Comments

12 Comments

Comments

Comments

Comments

Comments

1 Comment

Linked

Related