On MSVC converting utf-16 to utf-32 is easy - with C11's codecvt_utf16 locale facet. But in GCC (gcc (Debian 4.7.2-5) 4.7.2) seemingly this new feature hasn't been implemented yet. Is there a way to perform such conversion on Linux without iconv (preferrably using conversion tools of std library)?
- Is there a reason why you don't want to use iconv?Dale Wilson– Dale Wilson2014-05-28 18:48:46 +00:00Commented May 28, 2014 at 18:48
- Is there a reason why you don't want to implement it yourself?peppe– peppe2014-05-28 18:50:08 +00:00Commented May 28, 2014 at 18:50
- Well, the reason is that if this can be done with std - why to invent the wheel?Al Berger– Al Berger2014-05-28 18:54:13 +00:00Commented May 28, 2014 at 18:54
- Because your std doesn't implement it :-)peppe– peppe2014-05-28 18:54:36 +00:00Commented May 28, 2014 at 18:54
- The codecvt_utf16 is the only way in std for such conversion?Al Berger– Al Berger2014-05-28 18:56:36 +00:00Commented May 28, 2014 at 18:56
1 Answer
Decoding UTF-16 into UTF-32 is extremely easy.
You may want to detect at compile time the libc version you're using, and deploy your conversion routine if you detect a broken libc (without the functions you need).
Inputs:
- a pointer to the source UTF-16 data (
char16_t *,ushort *, -- for convenienceUTF16 *); - its size;
- a pointer to the UTF-32 data (
char32_t *,uint *-- for convenienceUTF32 *).
Code looks like:
void convert_utf16_to_utf32(const UTF16 *input, size_t input_size, UTF32 *output) { const UTF16 * const end = input + input_size; while (input < end) { const UTF16 uc = *input++; if (!is_surrogate(uc)) { *output++ = uc; } else { if (is_high_surrogate(uc) && input < end && is_low_surrogate(*input)) *output++ = surrogate_to_utf32(uc, *input++); else // ERROR } } } Error handling is left. You might want to insert a U+FFFD¹ into the stream and keep on going, or just bail out, really up to you. The auxiliary functions are trivial:
int is_surrogate(UTF16 uc) { return (uc - 0xd800u) < 2048u; } int is_high_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xd800; } int is_low_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xdc00; } UTF32 surrogate_to_utf32(UTF16 high, UTF16 low) { return (high << 10) + low - 0x35fdc00; } ¹ Cf. Unicode:
- § 3.9 Unicode Encoding Forms (Best Practices for Using U+FFFD)
- § 5.22 Best Practice for U+FFFD Substitution
² Also consider that the !is_surrogate(uc) branch is by far the most common (as well the non-error path in the second if), you might want to optimize that with __builtin_expect or similar.
8 Comments
is_surrogate. I was thinking about using unsigned arithmetic, not signed.