6

I need a code in C++ to convert a string given in wchar_t* to a UTF-16 string. It must work both on Windows and Linux. I've looked through a lot of web-pages during the search, but the subject still is not clear to me.

As I understand I need to:

  1. Call setlocale with LC_TYPE and UTF-16 encoding.
  2. Use wcstombs to convert wchar_t to UTF-16 string.
  3. Call setlocale to restore previous locale.

Do you know the way I can convert wchar_t* to UTF-16 in a portable way (Windows and Linux)?

2
  • Maybe my encoding-related questions #1, #2, #3 are of some use. Commented Mar 14, 2012 at 6:56
  • 2
    Which code set is the wchar_t string in? What type do you expect to use to represent the character type in the UTF-16 string? Is this simply a transform between UTF-32 (in the wchar_t) and UTF-16 in uint16_t? Or are you dealing with codeset conversion too? Portability is a noble goal; it is not always achievable, sadly. Do investigate ICU. Commented Mar 14, 2012 at 6:56

5 Answers 5

8

There is no single cross-platform method for doing this in C++03 (not without a library). This is in part because wchar_t is itself not the same thing across platforms. Under Windows, wchar_t is a 16-bit value, while on other platforms it is often a 32-bit value. So you would need two different codepaths to do it.

Sign up to request clarification or add additional context in comments.

Comments

5

C++11's std::codecvt_utf16 should work, I think.

std::codecvt_utf16 is a std::codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS2 or UCS4 character string (depending on the type of Elem).

See this: http://en.cppreference.com/w/cpp/locale/codecvt_utf16

2 Comments

All well and good, except G++ (or, more precisely, libstdc++) doesn't provide the <codecvt> header yet, so std::codecvt_utf16 is not available.
C++11 also introduces char16_t and char32_t types (and associated std::basic_string typedefs) to get away from wchar_t platform issues. For instance, use std::u16string wherever you need a UTF-16 encoded string.
3

You can assume that wchar_t is utf-32 in the non-Windows world. It is true on Linux and Mac OS X and most *nix systems (there are very few exceptions to that, and on systems you will probably never touch :-)

And wchar_t is utf-16 on Windows. So on Windows the conversion function can just do a memcpy :-)

On everything else, the conversion is algorithmic, and pretty simple. So there is no need of fancy support from 3rd party libraries.

Here is the basic algorithm: http://unicode.org/faq/utf_bom.html#utf16-3

And you can probably find find a dozen different implementations if you don't want to write your own :-)

Comments

2

The problem is with wchar_t being rather underspecified. You could use GNU libiconv to do what you want. It accepts special encoding name "wchar_t" as both source and target encoding. That way it will be portable to both Windows and Linux and elsewhere where you can provide libiconv.

Comments

-1

The g++ compiler appears to support wcstombs?

1 Comment

Are you asking a question or stating a fact?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.