How to convert (not necessarily programmatically) between Windows' wchar_t and GCC/Linux one?

Question

Suppose I have this Windows wchar_t string:

L"\x4f60\x597d"

and

L"\x00e4\x00a0\x597d"

and would like to convert it (not necessarily programmatically; it will be a one-time thing) to GCC/Linux wchar_t format, which is UTF-32 AFAIK. How do I do it? (a general explanation would be nice, but example based on this concrete case would be helpful as well)

Please don't direct me to character conversion sites. I would like to convert from L"\x(something)" form and not "end character" form.

Good question. I just can't figure how to do it correctly, but using some script that will modify the source code... +1 ... — paercebal
– paercebal, Commented Oct 25, 2008 at 10:21
A script is not neccessary. I can do it by hand, just don't know what would be correct equivalent. — Paweł Hajdan
– Paweł Hajdan, Commented Oct 25, 2008 at 10:24
can't you use -fwide-exec-charset=utf-16 to make gcc use the same charset like visual c++ ? — Johannes Schaub - litb
– Johannes Schaub - litb, Commented Dec 8, 2008 at 23:21
litb: it doesn't solve the problem when interfacing with other libraries compiled without this option. — Paweł Hajdan
– Paweł Hajdan, Commented Dec 30, 2008 at 12:33

Head Geek · Accepted Answer · 2008-10-25 15:28:19Z

Would converting from UTF-16 (the Visual C++ wchar_t form) to UTF-8, then possibly from UTF-8 to UCS-4 (the GCC wchar_t form), be an acceptable answer?

If so, then in Windows you could use the WideCharToMultiByte function (with CP_UTF8 for the CodePage parameter), for the first part of the conversion. Then you could either paste the resulting UTF-8 strings directly into your program, or convert them further. Here is a message showing how one person did it; you can also write your own code or do it manually (the official spec, with a section on exactly how to convert UTF-8 to UCS-4, can be found here). There may be an easier way, I'm not overly familiar with the conversion stuff in Linux yet.

Ignacio Vazquez-Abrams · Accepted Answer · 2008-10-26 06:43:04Z

You only need to worry about characters between \xD800 and \xDFFF inclusive. Every other character should map exactly the same from UTF-16 to UCS-4 when zero-filled.

lothar · Accepted Answer · 2008-12-08 23:10:18Z

One of the most used libraries to do character conversion is the ICU library http://icu-project.org/ It is e.g. used by some boost http://www.boost.org/ libraries.

Mihai Nita · Accepted Answer · 2009-07-24 08:53:04Z

Ignacio is right, if you don't use some rare Chinese characters (or some extinct scripts), then the mapping is one to one. (the official "lingo" is "if you don't have characters outside BMP")

This is the algorithm, just in case: http://unicode.org/faq/utf_bom.html#utf16-3 But again, most likely useless for your real case.

You can also use the free sources from Unicode (ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF)

Collectives™ on Stack Overflow

How to convert (not necessarily programmatically) between Windows' wchar_t and GCC/Linux one?

4 Answers 4

1 Comment

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Related