Printing unicode characters c++ linux

Question

I am using a raspberry pi and trying to print unicode characters with something like this:

test.cpp:

#include<iostream> using namespace std; int main() { char a=L'\u1234'; cout << a << endl; return 0; }

When I compile with g++, I get this warning:

test.cpp: In function "int main()": test.cpp:4:9: warning: large integer implicitly truncated to unsigned type [-Woverflow]

And the output is:

Also, this is not in the GUI and my distribution is raspbian wheezy if that is relevant.

Dmitrii S. · Accepted Answer · 2015-09-06 14:27:26Z

As a reference to one of the previous answers, you should not use wchar_t and w* functions on Linux. POSIX APIs use char data type and most POSIX implementations use UTF-8 as a default encoding. Quoting the C++ standard (ISO/IEC 14882:2011)

5.3.3 Sizeof

sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1. The result of sizeof applied to any other fundamental type (3.9.1) is implementation-defined. [ Note: in particular, sizeof(bool), sizeof(char16_t), sizeof(char32_t), and sizeof(wchar_t) are implementation-defined. 74 — end note ]

UTF-8 uses 1-byte code units and up to 4 code units to represent a code point, so char is enough to store UTF-8 strings, though to manipulate them you are going to need to find out if a specific code unit is represented by multiple bytes and build your processing logic with that in mind. wchar_t has an implementation-defined size and the Linux distributions that I have seen have a size of 4 bytes for this data type.

There is another problem that the mapping from the source code to the object code may transform your encoding in a compiler-specific way:

2.2 Phases of translation

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.

Anyway, in the most cases you don't have any conversions on your source code so the strings that you put into char* stay unmodified. If you encode your source code with UTF-8 then you are going to have bytes representing UTF-8 code units in your char*s.

As for your code example: it does not work as expected because 1 char has a size of 1 byte. Unicode code points may require several (up to 4) UTF-8 code units to be serialized (for UTF-8 1 code unit == 1 byte). You can see here that U+1234 requires three bytes E1 88 B4 when UTF-8 is used and, therefore, cannot be stored in a single char. If you modify your code as follows it's going to work just fine:

#include <iostream> int main() { char* str = "\u1234"; std::cout << str << std::endl; return 0; }

This is going to output ሴ though you may see nothing depending on your console and the installed fonts, the actual bytes are going to be there. Note that with double quotes you also have a \0 terminator in-memory.

You could also use an array, but not with single quotes since you would need a different data type (see here for more information):

#include <iostream> int main() { char* str = "\u1234"; std::cout << str << std::endl; // size of the array is 4 because \0 is appended // for string literals and there are 3 bytes // needed to represent the code point char arr[4] = "\u1234"; std::cout.write(arr, 3); std::cout << std::endl; return 0; }

The output is going to be ሴ on the two different lines in this case.

Devolus · Accepted Answer · 2013-08-04 07:28:53Z

You must set the local before you can use it, unless your native system is using it.

 setlocale(LC_CTYPE,"");

To print the stirng use wcout instead of cout.

#include<iostream> #include <locale> int main() { setlocale(LC_CTYPE,""); wchar_t a=L'\u1234'; std::wcout << a << std::endl; return 0; }

@BasileStarynkevitch, Uh, yes, I missed to change that as well. Fixed it.

dieram3 · Accepted Answer · 2013-08-04 06:48:00Z

3

You have to use wide characters:

try with:

#include<iostream> using namespace std; int main() { wchar_t a = L'\u1234'; wcout << a << endl; }

answered Aug 4, 2013 at 6:48

dieram3

5913 silver badges7 bronze badges

3 Comments

David G Over a year ago

Why must we use wide characters?

Dmitrii S. Over a year ago

@dieram3, no, you shouldn't. Firstly, wchar_t has nothing to do with Unicode - it is merely enough to store a 4-byte code unit on most Linux distributions and otherwise is implementation-defined. POSIX APIs use single-byte per code point encodings such as UTF-8 so you need to use plain 'char' data type. wchar_t usage for working with Unicode comes from Windows

Dmitrii S. Over a year ago

@0x499602D2 I would rather advice not to use wide characters on Linux, please take a look at my answer instead: stackoverflow.com/questions/18040393/…

Collectives™ on Stack Overflow

Printing unicode characters c++ linux

3 Answers 3

Comments

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

3 Comments

Linked

Related