Convert wstring to string encoded in UTF-8

Question

I need to convert between wstring and string. I figured out, that using codecvt facet should do the trick, but it doesn't seem to work for utf-8 locale.

My idea is, that when I read utf-8 encoded file to chars, one utf-8 character is read into two normal characters (which is how utf-8 works). I'd like to create this utf-8 string from wstring representation for library I use in my code.

Does anybody know how to do it?

I already tried this:

 locale mylocale("cs_CZ.utf-8"); mbstate_t mystate; wstring mywstring = L"čřžýáí"; const codecvt<wchar_t,char,mbstate_t>& myfacet = use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale); codecvt<wchar_t,char,mbstate_t>::result myresult; size_t length = mywstring.length(); char* pstr= new char [length+1]; const wchar_t* pwc; char* pc; // translate characters: myresult = myfacet.out (mystate, mywstring.c_str(), mywstring.c_str()+length+1, pwc, pstr, pstr+length+1, pc); if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok ) cout << "Translation successful: " << pstr << endl; else cout << "failed" << endl; return 0;

which returns 'failed' for cs_CZ.utf-8 locale and works correctly for cs_CZ.iso8859-2 locale.

take a look at this link: boost.org/doc/libs/1_42_0/libs/serialization/doc/codecvt.html might be of some help — smerlin
– smerlin, Commented Dec 5, 2010 at 13:14
"one utf-8 character is read into two normal characters (which is how utf-8 works)" No it's not. UTF-16 (mostly) works this way, but a UTF-8 codepoint is represented by one to 4 bytes, and a "character" can consist of multiple codepoints. — ephemient
– ephemient, Commented Dec 5, 2010 at 14:32

Emerick Rogul · Accepted Answer · 2014-01-24 21:18:46Z

99

The code below might help you :)

#include <codecvt> #include <string> // convert UTF-8 string to wstring std::wstring utf8_to_wstring (const std::string& str) { std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; return myconv.from_bytes(str); } // convert wstring to UTF-8 string std::string wstring_to_utf8 (const std::wstring& str) { std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; return myconv.to_bytes(str); }

edited Jan 24, 2014 at 21:18

Emerick Rogul

6,8143 gold badges35 silver badges40 bronze badges

answered Oct 15, 2012 at 21:00

skyde

3,0324 gold badges37 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Tom Over a year ago

But not on linux using libstdc++.

skyde Over a year ago

While the above work. I strongly suggest looking into Unicode library such as ICU and Boost.Locale.

Victor Mezrin Over a year ago

It works like a charm for any std::wstring. Small test here: stackoverflow.com/a/37531136/1802974

Alex Reinking Over a year ago

codecvt is deprecated as of C++17 and there is no replacement.

Sahil Singh Over a year ago

@AlexReinking cpp reference doesn't say that codecvt is deprecated. While some members are deprecated, there are new ones that are added (eg. C++20 adds std::codecvt<char32_t, char8_t, std::mbstate_t>). en.cppreference.com/w/cpp/locale/codecvt

|

hillel · Accepted Answer · 2010-12-05 18:18:36Z

What's your platform? Note that Windows does not support UTF-8 locales so this may explain why you're failing.

To get this done in a platform dependent way you can use MultiByteToWideChar/WideCharToMultiByte on Windows and iconv on Linux. You may be able to use some boost magic to get this done in a platform independent way, but I haven't tried it myself so I can't add about this option.

JWiesemann · Accepted Answer · 2022-08-19 06:57:26Z

On Windows you have to use std::codecvt_utf8_utf16<wchar_t>! Otherwise your conversion will fail on Unicode code points that need two 16 bit code units. Like 😉 (U+1F609)

#include <codecvt> #include <string> // convert UTF-8 string to wstring std::wstring utf8_to_wstring (const std::string& str) { std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv; return myconv.from_bytes(str); } // convert wstring to UTF-8 string std::string wstring_to_utf8 (const std::wstring& str) { std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv; return myconv.to_bytes(str); }

Chronial · Accepted Answer · 2023-08-23 15:11:09Z

The currently most upvoted answer is not platform-independent. It breaks on non-BMP characters (i.e. Emojis 🚒). JWiesemann already pointed this out in their answer, but their code will only work on windows.

So here's a correct platform-independent version:

#include <codecvt> #include <codecvt> #include <string> #include <type_traits> std::string wstring_to_utf8(std::wstring const& str) { std::wstring_convert<std::conditional_t< sizeof(wchar_t) == 4, std::codecvt_utf8<wchar_t>, std::codecvt_utf8_utf16<wchar_t>>> converter; return converter.to_bytes(str); } std::wstring utf8_to_wstring(std::string const& str) { std::wstring_convert<std::conditional_t< sizeof(wchar_t) == 4, std::codecvt_utf8<wchar_t>, std::codecvt_utf8_utf16<wchar_t>>> converter; return converter.from_bytes(str); }

On msvc this might generate some deprecation warnings. You can disable these by wrapping the functions in

#pragma warning(push) #pragma warning(disable : 4996) <the two functions> #pragma warning(pop)

See this answer to another question as to why it's ok to disable that warning.

Avinash · Accepted Answer · 2019-08-01 21:30:53Z

You can use boost's utf_to_utf converter to get char format to store in std::string.

std::string myresult = boost::locale::conv::utf_to_utf<char>(my_wstring);

Šimon Tóth · Accepted Answer · 2010-12-05 13:23:48Z

What locale does is that it gives the program information about the external encoding, but assuming that the internal encoding didn't change. If you want to output UTF-8 you need to do it from wchar_t not from char*.

What you could do is output it as raw data (not string), it should be then correctly interpreted if the systems locale is UTF-8.

Plus when using (w)cout/(w)cerr/(w)cin you need to imbue the locale on the stream.

UTF-8 uses 8-bit code units. char (as well as signed char and unsigned char) must be a minimum of 8 bits. I believe you may be thinking of UTF-16, UTF-32, UCS2, or UCS4.

Frank · Accepted Answer · 2012-07-26 22:54:56Z

The Lexertl library has an iterator that lets you do this:

std::string str; str.assign( lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.begin()), lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.end()));

Philipp · Accepted Answer · 2010-12-05 13:34:23Z

-9

C++ has no idea of Unicode. Use an external library such as ICU (UnicodeString class) or Qt (QString class), both support Unicode, including UTF-8.

answered Dec 5, 2010 at 13:34

Philipp

50.1k12 gold badges88 silver badges112 bronze badges

4 Comments

Šimon Tóth Over a year ago

-1 not really true, C++ supports locales which includes encoding (unfortunately this is broken for UTF-8 on Windows)

MSalters Over a year ago

Agree. C++ doesn't guarantee Unicode, or the existence of locale ("cs_CZ.utf-8");. But if you've got a system with that locale, it better work.

Justin Time - Reinstate Monica Over a year ago

No longer true as of C++11. char16_t is specifically intended for UTF-16, and char32_t is specifically intended for UTF-32; C++14 expands on this, by requiring that the char types be large enough to store 256 distinct values specifically to be suitable for UTF-8. C++11 also added classes codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16, as well as two new specialisations of codecvt (std::codecvt<char16_t, char, std::mbstate_t> and std::codecvt<char32_t, char, std::mbstate_t>). So, C++ now officially supports UTF-8, UTF-16, UTF-32, UCS2, and UCS4.

Justin Time - Reinstate Monica Over a year ago

Out of those codecvts: codecvt_utf8 converts between UTF-8 and UCS2/UCS4, codecvt_utf16 converts between UTF-16 and UCS2/UCS4, codecvt_utf8_utf16 converts between UTF-8 and UTF-16, codecvt's char16_t specialisation is also for UTF-8 and UTF-16, and codecvt's char32_t specialisation converts between UTF-8 and UTF-32. Not 100% sure of exactly how they work yet, I actually just started learning Unicode conversion today.

Collectives™ on Stack Overflow

Convert wstring to string encoded in UTF-8

8 Answers 8

8 Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

8 Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

4 Comments

Linked

Related