How to turn std::string that contains utf-16 encoded text in it into utf-16 wstring?

Question

So we get a string like РќРѕРІР°СЏ РїР°РїРєР° which is utf-8 representation of utf-16 encoded line (Новая папка in utf-16) we want to turn this string into wstring not changing encoding.. meaning literally bring all data from string to wstring with out any conversion. So we would get wstring with Новая папка contents. How to do such thing?

Update: What I meant to say - we have all data for correct utf-16 string inside of string. All we need is to put that data into wstring... that means if wstring contains of wchar which could happen to be 0000 we would have to put 2 string chars 00 and 00 together to get it. That is what I do not know how to do.

Update2 How I got here - a C++ lib I am obligated to use on my server is C style parser. and it returns me user request adress as std::string. while I make my clients send to me requests in such format.

url_encode(UTF16toUTF8(wstring)) //pseudocode.

where

string UTF16toUTF8(const wstring & in) { string out; unsigned int codepoint; bool completecode = false; for (wstring::const_iterator p = in.begin(); p != in.end(); ++p) { if (*p >= 0xd800 && *p <= 0xdbff) { codepoint = ((*p - 0xd800) << 10) + 0x10000; completecode = false; } else if (!completecode && *p >= 0xdc00 && *p <= 0xdfff) { codepoint |= *p - 0xdc00; completecode = true; } else { codepoint = *p; completecode = true; } if (completecode) { if (codepoint <= 0x7f) out.push_back(codepoint); else if (codepoint <= 0x7ff) { out.push_back(0xc0 | ((codepoint >> 6) & 0x1f)); out.push_back(0x80 | (codepoint & 0x3f)); } else if (codepoint <= 0xffff) { out.push_back(0xe0 | ((codepoint >> 12) & 0x0f)); out.push_back(0x80 | ((codepoint >> 6) & 0x3f)); out.push_back(0x80 | (codepoint & 0x3f)); } else { out.push_back(0xf0 | ((codepoint >> 18) & 0x07)); out.push_back(0x80 | ((codepoint >> 12) & 0x3f)); out.push_back(0x80 | ((codepoint >> 6) & 0x3f)); out.push_back(0x80 | (codepoint & 0x3f)); } } } return out; } std::string url_encode( std::string sSrc ) { const char SAFE[256] = { /* 0 1 2 3 4 5 6 7 8 9 A B C D E F */ /* 0 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* 1 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* 2 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* 3 */ 1,1,1,1, 1,1,1,1, 1,1,0,0, 0,0,0,0, /* 4 */ 0,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, /* 5 */ 1,1,1,1, 1,1,1,1, 1,1,1,0, 0,0,0,0, /* 6 */ 0,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, /* 7 */ 1,1,1,1, 1,1,1,1, 1,1,1,0, 0,0,0,0, /* 8 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* 9 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* A */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* B */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* C */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* D */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* E */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, /* F */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0 }; const char DEC2HEX[16 + 1] = "0123456789ABCDEF"; const unsigned char * pSrc = (const unsigned char *)sSrc.c_str(); const int SRC_LEN = sSrc.length(); unsigned char * const pStart = new unsigned char[SRC_LEN * 3]; unsigned char * pEnd = pStart; const unsigned char * const SRC_END = pSrc + SRC_LEN; for (; pSrc < SRC_END; ++pSrc) { if (SAFE[*pSrc]) *pEnd++ = *pSrc; else { // escape this char *pEnd++ = '%'; *pEnd++ = DEC2HEX[*pSrc >> 4]; *pEnd++ = DEC2HEX[*pSrc & 0x0F]; } } std::string sResult((char *)pStart, (char *)pEnd); delete [] pStart; return sResult; } std::string url_decode( std::string sSrc ) { // Note from RFC1630: "Sequences which start with a percent sign // but are not followed by two hexadecimal characters (0-9, A-F) are reserved // for future extension" const char HEX2DEC[256] = { /* 0 1 2 3 4 5 6 7 8 9 A B C D E F */ /* 0 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* 1 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* 2 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* 3 */ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,-1,-1, -1,-1,-1,-1, /* 4 */ -1,10,11,12, 13,14,15,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* 5 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* 6 */ -1,10,11,12, 13,14,15,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* 7 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* 8 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* 9 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* A */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* B */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* C */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* D */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* E */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, /* F */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1 }; const unsigned char * pSrc = (const unsigned char *)sSrc.c_str(); const int SRC_LEN = sSrc.length(); const unsigned char * const SRC_END = pSrc + SRC_LEN; const unsigned char * const SRC_LAST_DEC = SRC_END - 2; // last decodable '%' char * const pStart = new char[SRC_LEN]; char * pEnd = pStart; while (pSrc < SRC_LAST_DEC) { if (*pSrc == '%') { char dec1, dec2; if (-1 != (dec1 = HEX2DEC[*(pSrc + 1)]) && -1 != (dec2 = HEX2DEC[*(pSrc + 2)])) { *pEnd++ = (dec1 << 4) + dec2; pSrc += 3; continue; } } *pEnd++ = *pSrc++; } // the last 2- chars while (pSrc < SRC_END) *pEnd++ = *pSrc++; std::string sResult(pStart, pEnd); delete [] pStart; return sResult; }

Ofcourse I call url_decode, but I get a string..( so I hope now my problem is more clear.

possible duplicate of how to convert UTF-8 std::string to UTF-16 std::wstring — Nicol Bolas
– Nicol Bolas, Commented Aug 26, 2011 at 22:18
@Kabumbus: Is the high byte or the low byte first in your string? If you have 12 34 in your string would you expect to get 1234 or 3412 in your wstring? — john
– john, Commented Aug 26, 2011 at 22:25
@Kalumbus: I am now completely confused. I've changed my mind again and think that you really want to convert UTF-8 to UTF-16, rather than just doing some byte shifting. But who knows. If you do need UTF-8 to UTF-16 the link that Nicol Bolas posted above will work. — john
– john, Commented Aug 26, 2011 at 22:45
@Kalumbus, can you post an example of what you want using byte and word values instead of using foreign characters which are just very confusing. — john
– john, Commented Aug 26, 2011 at 22:57

Sqeaky · Accepted Answer · 2011-08-27 04:46:24Z

Here is what I am tinkering around with for a solution to your issue:

std::string wrong("РќРѕРІР°СЏ РїР°РїРєР°"); std::wstring correct( (wchar_t*)wrong.data() );

According to http://www.cplusplus.com/reference/string/string/data/ the data() member function should give us the raw char* and simply casting to a (wchar_t*) should cause it to stick the 00 and 00 together to make 0000, as you describe in you example.

I personally don't like casting like this, but this is all I have come up with so far.

Edit - Which library are you using? Does it come with some other function to reverse what it has done?

If it is popular surely someone else has had this issue before. How did They solve it?

Edit 2 - Here is a disgusting way, using malloc, some assumptions that there won't be any half code-points in the original string, and another terrible cast. :(

std::string wrong("РќРѕРІР°СЏ РїР°РїРєР°"); wchar_t *lesswrong = (wchar_t*) malloc (wrong.size()/sizeof(wchar_t) + sizeof(wchar_t)); lesswrong = (wchar_t*)wrong.data(); lesswrong[wrong.size()] = '\0'; std::wstring correct( lesswrong );

There is no way this can be correct. Even if it works it is so ugly.

Edit 3 - Like Kerrick sadi, this is a better way to do it.

std::string wrong("РќРѕРІР°СЏ РїР°РїРєР°"); std::wstring correct( (wchar_t*)wrong.data(), wrong.size()/2 );

You might also have to pass a size argument here, because data() doesn't give a (wchar-)null-terminated string.
This won't work if sizeof(wchar_t) is not equal to 2. On Mac OS X, sizeof(wchar_t) is 4 for 64-bit programs.
Will only work if the byte ordering correct, plus you have the lack of a null terminator that Kerrek mentions.
This assumes that the data was encoded on this machine, so byte order should match, but you are right, on a unix machine this would stuff 2 utf16 codepoints into one utf32 codepoint. I made another assumption in assuming that the library he is using would use the same size for wchar_t that this code would use (seems like a stretch I admit).
@Sqeaky : std::wstring correct( (wchar_t*)wrong.data(), wrong.size() ); is definitely not correct. The constructor wants the number of characters, but wrong.size() is giving the number of bytes.

Sylvain Defresne · Accepted Answer · 2011-08-26 23:31:59Z

If I understand you correctly, you have a std::string object that contains an UTF-16 encoded string, and you want to convert it to a std::wstring without changing the encoding. If I'm correct, then, you don't have to do conversion of encoding, nor of the representation but only of the storage.

You also think that the string may have incorrectly be encoded into UTF-8. However, UTF-8 is a variable length encoding, but the length of your incorrectly interpreted data (РќРѕРІР°СЏ РїР°РїРєР° is 22 characters long) is exactly twice the length of your original data (Новая папка is 11 characters long). This is why I suspect that this may be just a case of wrong storage and not wrong encoding.

The following code does that:

std::wstring convert_utf16_string_to_wstring(const std::string& input) { assert((input.size() & 1) == 0); size_t len = input.size() / 2; std::wstring output; output.resize(len); for (size_t i = 0; i < len; ++i) { unsigned char chr1 = (unsigned char)input[2 * i]; unsigned char chr2 = (unsigned char)input[2 * i + 1]; // Note: this line suppose that you use `UTF-16-BE` both for // the std::string and the std::wstring. You'll have to swap // chr1 & chr2 if this is not the case. unsigned short val = (chr2 << 8)|(chr1); output[i] = (wchar_t)(val); } return output; }

If you know that on all the platform you target sizeof(wchar_t) equal 2 (this is not the case one Mac OS for 64-bit programs for exemple where sizeof(wchar_t) equals 4), then you can use a simple cast:

std::wstring convert_utf16_string_to_wstring(const std::string& input) { assert(sizeof(wchar_t) == 2); // A static assert would be better here assert((input.size() & 1) == 0); return input.empty() ? std::wstring() : std::wstring((wchar_t*)input[0], input.size() / 2); }

This does not convert from UTF-8 to UTF-16, it merely converts from a byte representation of UTF-16 to a word representation.
well I get ⿯뾍껯뾢ꃯ뾿⃯뾯ꃯ뾯ꫯ뾠 which is much more like real letters but something is wrong..(
@bdonlan: It's not at all clear to me that the OP does want a UTF-8 to UTF-16 conversion. I've changed my mind twice on this.
@bdonlan: This OP does not want conversion from UTF-16 to UTF-8. He has a std::string that contains data encoded in UTF-16 and he want that data stored in a std::wstring.
@Kabumbus: Sorry, I inverted chr1 and chr2 in one of the line. I've edited my post with what I expect is the correct version. BTW, if sizeof(wchar_t) == 2 then you can probably just do a memcpy(&output[0], &input[0], input.size()) instead of the loop.

Collectives™ on Stack Overflow

How to turn std::string that contains utf-16 encoded text in it into utf-16 wstring?

2 Answers 2

6 Comments

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

7 Comments

Linked

Related