30

I need to convert between wstring and string. I figured out, that using codecvt facet should do the trick, but it doesn't seem to work for utf-8 locale.

My idea is, that when I read utf-8 encoded file to chars, one utf-8 character is read into two normal characters (which is how utf-8 works). I'd like to create this utf-8 string from wstring representation for library I use in my code.

Does anybody know how to do it?

I already tried this:

 locale mylocale("cs_CZ.utf-8"); mbstate_t mystate; wstring mywstring = L"čřžýáí"; const codecvt<wchar_t,char,mbstate_t>& myfacet = use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale); codecvt<wchar_t,char,mbstate_t>::result myresult; size_t length = mywstring.length(); char* pstr= new char [length+1]; const wchar_t* pwc; char* pc; // translate characters: myresult = myfacet.out (mystate, mywstring.c_str(), mywstring.c_str()+length+1, pwc, pstr, pstr+length+1, pc); if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok ) cout << "Translation successful: " << pstr << endl; else cout << "failed" << endl; return 0; 

which returns 'failed' for cs_CZ.utf-8 locale and works correctly for cs_CZ.iso8859-2 locale.

3
  • 1
    take a look at this link: boost.org/doc/libs/1_42_0/libs/serialization/doc/codecvt.html might be of some help Commented Dec 5, 2010 at 13:14
  • 3
    "one utf-8 character is read into two normal characters (which is how utf-8 works)" No it's not. UTF-16 (mostly) works this way, but a UTF-8 codepoint is represented by one to 4 bytes, and a "character" can consist of multiple codepoints. Commented Dec 5, 2010 at 14:32
  • ephimient - yes - I know it, I just wrote it badly :) Commented Dec 5, 2010 at 18:06

8 Answers 8

99

The code below might help you :)

#include <codecvt> #include <string> // convert UTF-8 string to wstring std::wstring utf8_to_wstring (const std::string& str) { std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; return myconv.from_bytes(str); } // convert wstring to UTF-8 string std::string wstring_to_utf8 (const std::wstring& str) { std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; return myconv.to_bytes(str); } 
Sign up to request clarification or add additional context in comments.

8 Comments

But not on linux using libstdc++.
While the above work. I strongly suggest looking into Unicode library such as ICU and Boost.Locale.
It works like a charm for any std::wstring. Small test here: stackoverflow.com/a/37531136/1802974
codecvt is deprecated as of C++17 and there is no replacement.
@AlexReinking cpp reference doesn't say that codecvt is deprecated. While some members are deprecated, there are new ones that are added (eg. C++20 adds std::codecvt<char32_t, char8_t, std::mbstate_t>). en.cppreference.com/w/cpp/locale/codecvt
|
10

What's your platform? Note that Windows does not support UTF-8 locales so this may explain why you're failing.

To get this done in a platform dependent way you can use MultiByteToWideChar/WideCharToMultiByte on Windows and iconv on Linux. You may be able to use some boost magic to get this done in a platform independent way, but I haven't tried it myself so I can't add about this option.

Comments

3

On Windows you have to use std::codecvt_utf8_utf16<wchar_t>! Otherwise your conversion will fail on Unicode code points that need two 16 bit code units. Like 😉 (U+1F609)

#include <codecvt> #include <string> // convert UTF-8 string to wstring std::wstring utf8_to_wstring (const std::string& str) { std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv; return myconv.from_bytes(str); } // convert wstring to UTF-8 string std::string wstring_to_utf8 (const std::wstring& str) { std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv; return myconv.to_bytes(str); } 

Comments

3

The currently most upvoted answer is not platform-independent. It breaks on non-BMP characters (i.e. Emojis 🚒). JWiesemann already pointed this out in their answer, but their code will only work on windows.

So here's a correct platform-independent version:

#include <codecvt> #include <codecvt> #include <string> #include <type_traits> std::string wstring_to_utf8(std::wstring const& str) { std::wstring_convert<std::conditional_t< sizeof(wchar_t) == 4, std::codecvt_utf8<wchar_t>, std::codecvt_utf8_utf16<wchar_t>>> converter; return converter.to_bytes(str); } std::wstring utf8_to_wstring(std::string const& str) { std::wstring_convert<std::conditional_t< sizeof(wchar_t) == 4, std::codecvt_utf8<wchar_t>, std::codecvt_utf8_utf16<wchar_t>>> converter; return converter.from_bytes(str); } 

On msvc this might generate some deprecation warnings. You can disable these by wrapping the functions in

#pragma warning(push) #pragma warning(disable : 4996) <the two functions> #pragma warning(pop) 

See this answer to another question as to why it's ok to disable that warning.

Comments

2

You can use boost's utf_to_utf converter to get char format to store in std::string.

std::string myresult = boost::locale::conv::utf_to_utf<char>(my_wstring); 

Comments

-1

What locale does is that it gives the program information about the external encoding, but assuming that the internal encoding didn't change. If you want to output UTF-8 you need to do it from wchar_t not from char*.

What you could do is output it as raw data (not string), it should be then correctly interpreted if the systems locale is UTF-8.

Plus when using (w)cout/(w)cerr/(w)cin you need to imbue the locale on the stream.

1 Comment

UTF-8 uses 8-bit code units. char (as well as signed char and unsigned char) must be a minimum of 8 bits. I believe you may be thinking of UTF-16, UTF-32, UCS2, or UCS4.
-2

The Lexertl library has an iterator that lets you do this:

std::string str; str.assign( lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.begin()), lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.end())); 

Comments

-9

C++ has no idea of Unicode. Use an external library such as ICU (UnicodeString class) or Qt (QString class), both support Unicode, including UTF-8.

4 Comments

-1 not really true, C++ supports locales which includes encoding (unfortunately this is broken for UTF-8 on Windows)
Agree. C++ doesn't guarantee Unicode, or the existence of locale ("cs_CZ.utf-8");. But if you've got a system with that locale, it better work.
No longer true as of C++11. char16_t is specifically intended for UTF-16, and char32_t is specifically intended for UTF-32; C++14 expands on this, by requiring that the char types be large enough to store 256 distinct values specifically to be suitable for UTF-8. C++11 also added classes codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16, as well as two new specialisations of codecvt (std::codecvt<char16_t, char, std::mbstate_t> and std::codecvt<char32_t, char, std::mbstate_t>). So, C++ now officially supports UTF-8, UTF-16, UTF-32, UCS2, and UCS4.
Out of those codecvts: codecvt_utf8 converts between UTF-8 and UCS2/UCS4, codecvt_utf16 converts between UTF-16 and UCS2/UCS4, codecvt_utf8_utf16 converts between UTF-8 and UTF-16, codecvt's char16_t specialisation is also for UTF-8 and UTF-16, and codecvt's char32_t specialisation converts between UTF-8 and UTF-32. Not 100% sure of exactly how they work yet, I actually just started learning Unicode conversion today.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.