How to convert UTF-8 std::string to UTF-16 std::wstring?

Question

If I have a UTF-8 std::string how do I convert it to a UTF-16 std::wstring? Actually, I want to compare two Persian words.

possible duplicate of how can I compare utf8 string such as persian words in c++? or this. — Kerrek SB
– Kerrek SB, Commented Aug 22, 2011 at 21:47

Yuchen · Accepted Answer · 2017-03-27 12:23:21Z

55

This is how you do it with C++11:

std::string str = "your string in utf8"; std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter; std::wstring wstr = converter.from_bytes(str);

And these are the headers you need:

#include <iostream> #include <string> #include <locale> #include <codecvt>

A more complete example available here: http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

edited Mar 27, 2017 at 12:23

answered Jul 14, 2016 at 20:07

Yuchen

33.4k29 gold badges182 silver badges249 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Cris Luengo Over a year ago

Great answer, thanks! ...but do follow the example at cppreference.com. wchar_t is not a 16-bit type on operating systems other than Windows. You need to use char16_t instead.

Yuchen Over a year ago

@CrisLuengo thanks! 👍 I updated the answer to use char16_t instead.

user2918461 Over a year ago

Not working with g++ 6.2 or clang++ 3.8 on lubuntu 16.04

Andrey Belykh Over a year ago

Unfortunately, this was deprecated in C++17. mariusbancila.ro/blog/2018/07/05/…

thomasa88 Over a year ago

And MSVC (on Windows) suggests this:

The C++ Standard doesn't provide equivalent non-deprecated functionali ty; consider using MultiByteToWideChar() and WideCharToMultiByte() from <Windows.h> instead.

john · Accepted Answer · 2011-08-22 22:36:36Z

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8) { std::vector<unsigned long> unicode; size_t i = 0; while (i < utf8.size()) { unsigned long uni; size_t todo; bool error = false; unsigned char ch = utf8[i++]; if (ch <= 0x7F) { uni = ch; todo = 0; } else if (ch <= 0xBF) { throw std::logic_error("not a UTF-8 string"); } else if (ch <= 0xDF) { uni = ch&0x1F; todo = 1; } else if (ch <= 0xEF) { uni = ch&0x0F; todo = 2; } else if (ch <= 0xF7) { uni = ch&0x07; todo = 3; } else { throw std::logic_error("not a UTF-8 string"); } for (size_t j = 0; j < todo; ++j) { if (i == utf8.size()) throw std::logic_error("not a UTF-8 string"); unsigned char ch = utf8[i++]; if (ch < 0x80 || ch > 0xBF) throw std::logic_error("not a UTF-8 string"); uni <<= 6; uni += ch & 0x3F; } if (uni >= 0xD800 && uni <= 0xDFFF) throw std::logic_error("not a UTF-8 string"); if (uni > 0x10FFFF) throw std::logic_error("not a UTF-8 string"); unicode.push_back(uni); } std::wstring utf16; for (size_t i = 0; i < unicode.size(); ++i) { unsigned long uni = unicode[i]; if (uni <= 0xFFFF) { utf16 += (wchar_t)uni; } else { uni -= 0x10000; utf16 += (wchar_t)((uni >> 10) + 0xD800); utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00); } } return utf16; }

thank You! thank You! it worked... I cant believe it :) thank You for your time john
Really glad it helped. It really is just a matter of asking the right question. There's a lot of knowledge on this forum, but newbies often can't access that knowledge because they don't know what to ask.
@aliakbarian: I've actually just spotted a minor bug in my code, you probably should copy it again. I changed this if (j == utf8.size()) to this if (i == utf8.size()).
Note: this is windows only. Unix system use 32bit for wchar_t Alltho you can still do std::wstring wstr(str.begin(), str.end()); on Windows.
@coo Sure, that's possible. If your goal is to trash your data. Simply widening every UTF-8 code unit to fit into a UTF-16 code unit does not magically convert between those encodings. This will just produce gibberish for any code unit in the input sequence that doesn't happen to encode an ASCII code point.

Yochai Timmer · Accepted Answer · 2020-01-02 09:47:41Z

To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).

#include <string> #include <codecvt> int main() { std::string original8 = u8"הלו"; std::wstring original16 = L"הלו"; //C++11 format converter std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert; //convert to UTF8 and std::string std::string utf8NativeString = convert.to_bytes(original16); std::wstring utf16NativeString = convert.from_bytes(original8); assert(utf8NativeString == original8); assert(utf16NativeString == original16); return 0; }

As already pointed in other comments this was removed in C++17 en.cppreference.com/w/cpp/locale/wstring_convert

Community · Accepted Answer · 2017-05-23 11:46:19Z

There are some relevant Q&A here and here which is worth a read.

Basically you need to convert the string to a common format -- my preference is always to convert to UTF-8, but your mileage may wary.

There have been lots of software written for doing the conversion -- the conversion is straigth forwards and can be written in a few hours -- however why not pick up something already done such as the UTF-8 CPP

If you're Windows only: msdn.microsoft.com/en-us/library/dd319072(v=VS.85).aspx. Otherwise, use a portable library.

Srijan Chaudhary · Accepted Answer · 2020-07-08 23:06:50Z

Microsoft has developed a beautiful library for such conversions as part of their Casablanca project also named as CPPRESTSDK. This is marked under the namespaces utility::conversions.

A simple usage of it would look something like this on using namespace

utility::conversions

utf8_to_utf16("sample_string");

isudfv · Accepted Answer · 2024-05-19 17:54:40Z

with winrt, you can easily convert std::string of utf8 to hstring(wchar) by winrt::to_hstring()

jj1 · Accepted Answer · 2011-08-23 08:14:49Z

This page also seems useful: http://www.codeproject.com/KB/string/UtfConverter.aspx

In the comment section of that page, there are also some interesting suggestions for this task like:

// Get en ASCII std::string from anywhere std::string sLogLevelA = "Hello ASCII-world!"; std::wstringstream ws; ws << sLogLevelA.c_str(); std::wstring sLogLevel = ws.str();

Or

// To std::string: str.assign(ws.begin(), ws.end()); // To std::wstring ws.assign(str.begin(), str.end());

Though I'm not sure the validity of these approaches...

assign() is deffinetly not the way to convert UTF-8<->UTF-16. Don't try this at home

Collectives™ on Stack Overflow

How to convert UTF-8 std::string to UTF-16 std::wstring?

7 Answers 7

5 Comments

7 Comments

1 Comment

1 Comment

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

5 Comments

7 Comments

1 Comment

1 Comment

Comments

Comments

1 Comment

Linked

Related