6

I know this question has been asked quite a few times here, and i did read some of the answers, But there are a few suggested solutions and im trying to figure out the best of them.

I'm writing a C99 app that basically receives XML text encoded in UTF-8.

Part of it's job is to copy and manipulate that string (finding a substr, cat it, ex..)

As i would rather not to use an outside not-standard library right now, im trying to implement it using wchar_t.

Currently, im using mbstowcs to convert it to wchar_t for easy manipulation, and for some input i tried in different languages - it worked fine.

Thing is, i did read some people out there had some issues with UTF-8 and mbstowcs, so i would like to hear out about whether this use is permitted/acceptable.

Other option i faced was using iconv with WCHAR_T parameter. Thing is, im working on a platform(not a PC) which it's locale is very very limit to only ANSI C locale. How about that?

I did also encounter some C++ library which is very popular. but im limited for C99 implementation.

Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers? but then, which manipulation functions should i use instead?

Happy to hear some thoughts. thanks.

5
  • You will run into difficulties and have problems, I guarantee it. UTF-8 is an encoding, wchar_t is a storage detail, the two are unrelated. wchar_t just makes handling UTF-16 slightly easier, but what about surrogate-pairs? Multi-byte single-characters in UTF-8? Commented Jan 14, 2014 at 19:56
  • What is wrong with mbstowcs? Commented Jan 14, 2014 at 20:48
  • @Johnnyguitar the answers posted below provide a better explanation of my point. Commented Jan 14, 2014 at 21:05
  • @Dai: "Surrogate pairs" don't exist in UTF-8. They are an encoding detail of UTF-16. Commented Jan 14, 2014 at 22:19
  • @R.. you're right, sorry; my mistake. Commented Jan 14, 2014 at 22:57

2 Answers 2

5

C does not define what encoding the char and wchar_t types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char is not UTF-8 then mbstowcs will result in data corruption.

As noted in the rationale for the C99 standard:

However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.

...

C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.

Sourced from here.

So, if you have UTF-8 data in your chars there isn't a standard API way to convert that to wchar_ts.

In my opinion wchar_t should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t is always UTF-16LE on Windows so you may still need to have more than one wchar_t to represent a single Unicode code point anyway.

I suggest you investigate the ICU project - at least from an educational standpoint.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks alot! I digged in for some info on ICU but i couldn't find any useful examples. Should i use ICU just for converting the string, or does it have any functions for string manipulation as well?
I suggest you start with the ICU API to see if it meets your needs.
As i understood, in order to work well with string manipulation functions(as explained here link)on a UTF-8 string on ICU, i would have to convert my string to UTF-16. Question is, if some of my string includes letters which uses 3-4 bytes in UTF-8, how are they "translated" to UTF-16 which uses 1-2 bytes?
Looks like you'll have to convert your utf-8 encoded data "manually" to utf-16. You can do this, sure. You'll have to detect your utf-8 byte stream for singlebytes as well as 2,3 and 4-byte sequences. I hope you know how to decode the codepoints. For any codepoint that is a surrogate, drop it, it's illegal. For all codepoints lower than 0xFFFF you can just set the value onto your wchar (should be 16 bits wide). For codepoints higher than 0xFFFF you must create a surrogate pair. If your wchar is 32 bits wide, just transcode from utf-8 to utf-32.
By the way, UTF-16 doesn't use 1-2 bytes, it uses words. A surrogate pair is really a dword inside the word stream, it encodes codepoints higher than 0xFFFF. In surrogate pairs, the high surrogate must come first, then the low surrogate. Vice versa is illegal, also if surrogates don't appear as a pair, those are orphaned surrogates.
|
1

Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers?

You could do that with conditional typedefs like this:

#if defined(__STDC_UTF_16__) typedef _Char16_t CHAR16; #elif defined(_WIN32) typedef wchar_t CHAR16; #else typedef uint16_t CHAR16; #endif #if defined(__STDC_UTF_32__) typedef _Char32_t CHAR32; #elif defined(__STDC_ISO_10646__) typedef wchar_t CHAR32; #else typedef uint32_t CHAR32; #endif 

This will define the typedefs CHAR16 and CHAR32 to use the new C++11 character types if available, but otherwise fall back to using wchar_t when possible and fixed-width unsigned integers otherwise.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.