39

If I have a UTF-8 std::string how do I convert it to a UTF-16 std::wstring? Actually, I want to compare two Persian words.

2

7 Answers 7

55

This is how you do it with C++11:

std::string str = "your string in utf8"; std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter; std::wstring wstr = converter.from_bytes(str); 

And these are the headers you need:

#include <iostream> #include <string> #include <locale> #include <codecvt> 

A more complete example available here: http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

Sign up to request clarification or add additional context in comments.

5 Comments

Great answer, thanks! ...but do follow the example at cppreference.com. wchar_t is not a 16-bit type on operating systems other than Windows. You need to use char16_t instead.
@CrisLuengo thanks! 👍 I updated the answer to use char16_t instead.
Not working with g++ 6.2 or clang++ 3.8 on lubuntu 16.04
Unfortunately, this was deprecated in C++17. mariusbancila.ro/blog/2018/07/05/…
And MSVC (on Windows) suggests this: The C++ Standard doesn't provide equivalent non-deprecated functionali ty; consider using MultiByteToWideChar() and WideCharToMultiByte() from <Windows.h> instead.
31

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8) { std::vector<unsigned long> unicode; size_t i = 0; while (i < utf8.size()) { unsigned long uni; size_t todo; bool error = false; unsigned char ch = utf8[i++]; if (ch <= 0x7F) { uni = ch; todo = 0; } else if (ch <= 0xBF) { throw std::logic_error("not a UTF-8 string"); } else if (ch <= 0xDF) { uni = ch&0x1F; todo = 1; } else if (ch <= 0xEF) { uni = ch&0x0F; todo = 2; } else if (ch <= 0xF7) { uni = ch&0x07; todo = 3; } else { throw std::logic_error("not a UTF-8 string"); } for (size_t j = 0; j < todo; ++j) { if (i == utf8.size()) throw std::logic_error("not a UTF-8 string"); unsigned char ch = utf8[i++]; if (ch < 0x80 || ch > 0xBF) throw std::logic_error("not a UTF-8 string"); uni <<= 6; uni += ch & 0x3F; } if (uni >= 0xD800 && uni <= 0xDFFF) throw std::logic_error("not a UTF-8 string"); if (uni > 0x10FFFF) throw std::logic_error("not a UTF-8 string"); unicode.push_back(uni); } std::wstring utf16; for (size_t i = 0; i < unicode.size(); ++i) { unsigned long uni = unicode[i]; if (uni <= 0xFFFF) { utf16 += (wchar_t)uni; } else { uni -= 0x10000; utf16 += (wchar_t)((uni >> 10) + 0xD800); utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00); } } return utf16; } 

7 Comments

thank You! thank You! it worked... I cant believe it :) thank You for your time john
Really glad it helped. It really is just a matter of asking the right question. There's a lot of knowledge on this forum, but newbies often can't access that knowledge because they don't know what to ask.
@aliakbarian: I've actually just spotted a minor bug in my code, you probably should copy it again. I changed this if (j == utf8.size()) to this if (i == utf8.size()).
Note: this is windows only. Unix system use 32bit for wchar_t Alltho you can still do std::wstring wstr(str.begin(), str.end()); on Windows.
@coo Sure, that's possible. If your goal is to trash your data. Simply widening every UTF-8 code unit to fit into a UTF-16 code unit does not magically convert between those encodings. This will just produce gibberish for any code unit in the input sequence that doesn't happen to encode an ASCII code point.
|
3

To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).

#include <string> #include <codecvt> int main() { std::string original8 = u8"הלו"; std::wstring original16 = L"הלו"; //C++11 format converter std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert; //convert to UTF8 and std::string std::string utf8NativeString = convert.to_bytes(original16); std::wstring utf16NativeString = convert.from_bytes(original8); assert(utf8NativeString == original8); assert(utf16NativeString == original16); return 0; } 

1 Comment

As already pointed in other comments this was removed in C++17 en.cppreference.com/w/cpp/locale/wstring_convert
2

There are some relevant Q&A here and here which is worth a read.

Basically you need to convert the string to a common format -- my preference is always to convert to UTF-8, but your mileage may wary.

There have been lots of software written for doing the conversion -- the conversion is straigth forwards and can be written in a few hours -- however why not pick up something already done such as the UTF-8 CPP

1 Comment

If you're Windows only: msdn.microsoft.com/en-us/library/dd319072(v=VS.85).aspx. Otherwise, use a portable library.
0

Microsoft has developed a beautiful library for such conversions as part of their Casablanca project also named as CPPRESTSDK. This is marked under the namespaces utility::conversions.

A simple usage of it would look something like this on using namespace

utility::conversions

utf8_to_utf16("sample_string"); 

Comments

0

with winrt, you can easily convert std::string of utf8 to hstring(wchar) by winrt::to_hstring()

Comments

-1

This page also seems useful: http://www.codeproject.com/KB/string/UtfConverter.aspx

In the comment section of that page, there are also some interesting suggestions for this task like:

// Get en ASCII std::string from anywhere std::string sLogLevelA = "Hello ASCII-world!"; std::wstringstream ws; ws << sLogLevelA.c_str(); std::wstring sLogLevel = ws.str(); 

Or

// To std::string: str.assign(ws.begin(), ws.end()); // To std::wstring ws.assign(str.begin(), str.end()); 

Though I'm not sure the validity of these approaches...

1 Comment

assign() is deffinetly not the way to convert UTF-8<->UTF-16. Don't try this at home

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.