5

I searched a lot, but couldn't find anything:

unsigned int unicodeChar = 0x5e9; unsigned int utf8Char; uni2utf8(unicodeChar, utf8Char); assert(utf8Char == 0xd7a9); 

Is there a library (preferably boost) that implements something similar to uni2utf8?

4
  • For the new c++11 unicode string literals see stackoverflow.com/questions/6796157/… Commented Jul 22, 2012 at 19:50
  • 2
    What you're asking for does not make sense and cannot work. There is no such thing as a UTF-8 character. There are UTF-8 code units, which are 8-bit values that when properly decoded form a Unicode codepoint. But UTF-8 code units are not stored in unsigned ints of 32-bits in size. Each code unit is 8 bits in size; therefore, the way to store a Unicode codepoint in UTF-8 is as a sequence of code units. A string, not an integer. Commented Jul 22, 2012 at 20:16
  • 1. UTF8 is unicode 2. use nowide. Commented Jul 23, 2012 at 20:56
  • utf8 is not Unicode, utf8 is a method for representing numbers. unicode on the other hand is a mapping between symbols to numbers. Abstract numbers, not their representation. Commented Jan 28, 2014 at 20:29

4 Answers 4

15

Unicode conversions are part of C++11:

#include <codecvt> #include <locale> #include <string> #include <cassert> int main() { std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert; std::string utf8 = convert.to_bytes(0x5e9); assert(utf8.length() == 2); assert(utf8[0] == '\xD7'); assert(utf8[1] == '\xA9'); } 
Sign up to request clarification or add additional context in comments.

6 Comments

is there a boost equivalent? (for those who can't code c++11)
@Ezra Yes, there is Boost.Locale, I've added another answer for that.
You don't need codecvt_utf8. codecvt<char32_t,char,std::mbstate> converts between UTF-32 and UTF-8, and codecvt<char16_t,char,std::mbstate> converts between UTF-16 and UTF-8.
@bames53: I strongly suspect that works only if char is natively UTF-8. E.g. Linux, but not Windows.
@bames53 Three reasons to prefer codecvt_utf8 (at least in conjunction with wstring_convert): 1. It contains the word utf8, so it's clearer to the reader what's happening. 2. It's shorter (fewer template arguments required). 3. codecvt has a protected destructor and is therefore not usable as a drop-in replacement for codecvt_utf8. If you're using wstring_convert, you need C++11 anyway, so so always have codecvt_utf8 at your disposal. I don't see much value in using codecvt here.
|
10

Boost.Locale has also functions for encoding conversions:

#include <boost/locale.hpp> int main() { unsigned int point = 0x5e9; std::string utf8 = boost::locale::conv::utf_to_utf<char>(&point, &point + 1); assert(utf8.length() == 2); assert(utf8[0] == '\xD7'); assert(utf8[1] == '\xA9'); } 

Comments

4

You might want to give a try to UTF8-CPP library. Encoding a Unicode character with it would look like this:

std::wstring unicodeChar(L"\u05e9"); std::string utf8Char; encode_utf8(unicodeChar, utf8Char); 

std::string is used here just as a container for UTF-8 bytes.

6 Comments

Doesn't this assume that your unicodeChar is encoded in UTF-32? As far as I know, "wide strings" in C and C++ have an unspecified, opaque "system encoding" that could be anything. You'd first need to convert your wide string to UTF-32 using something like iconv.
@KerrekSB Do you see me using raw C wide strings alone or in conjunction with platform-specific implementation of std::wstring?
@KerrekSB Did I forget to "cook" that raw wide string with std::wstring, which knows full well how such strings should be handled on the current platform/compiler?
What do you think wstring is? It's just a container of wchar_ts, and you initialize those from a bog-standard wide string literal. Where's the "cooking"?
This code indeed won't work on Windows, where wchar_t is UCS-2/UTF-16 (16 bits, at least) and therefore cannot convert U+10000 to UTF-8
|
-3

Use sprintf. (:

cstring = sprintf("%S", unicodestring);

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.