wcstombs: character encoding?

Question

wcstombs documentation says, it "converts the sequence of wide-character codes to multibyte string". But it never says what is a "wide-character".

Is it implicit, like say it converts utf-16 to utf-8 or the conversion is defined by some environment variable?

Also what is the typical use case of wcstombs?

A "wide-character" is a wchar_t.

kennytm
– kennytm

2010-02-03 07:08:04 +00:00
Commented Feb 3, 2010 at 7:08 — kennytm
– kennytm, Commented Feb 3, 2010 at 7:08

Michael Burr · Accepted Answer · 2010-02-03 08:18:37Z

You use the setlocale() standard function with the LC_CTYPE (or LC_ALL) category to set the mapping the library uses between wchar_t characters and multibyte characters. The actual locale name passed to setlocale() is implementation defined, so you'll need to look it up in your compiler's docs.

For example, with MSVC you might use

setlocale( LC_ALL, ".1252" );

to set the C runtime to use codepage 1252 as the multibyte character set. Note that MSVC docs explicitly indicates that the locale cannot be set to UTF-7 or UTF8 for the multibyte character sets:

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.

The "wide-character" wchar_t type is intended to be able to support any character set the system supports - the standard doesn't define the size of a wchar_t type (it could be as small as a char or any of the larger integer types). On Windows it's the system's 'internal' Unicode encoding, which is UTF-16 (UCS-2 before WinXP). Honestly, I can't find a direct quote on that in the MSVC docs, though. Strictly speaking, the implementation should call this out, but I can't find it.

Warning: there is no standard for the locale string in setlocale, so it is not easy to do anything cross-platform. For instance .1252 is valid on Windows, but not on UNIX/Linux (there you will see stuff like en_US.UTF-8 or en_US.iso889-1)

caf · Accepted Answer · 2010-02-03 07:18:59Z

3

It converts whatever your platform uses for a "wide char" (which I'm lead to believe is indeed UCS2 on Windows, but is usually UCS4 on UNIX) into your current locale's default multibyte character encoding. If your locale is a UTF-8 one, then that is the multibyte encoding that will be used - but note that there are other possibilities, like JIS.

answered Feb 3, 2010 at 7:18

caf

241k42 gold badges344 silver badges479 bronze badges

3 Comments

Mihai Nita Over a year ago

On Windows that is UTF-16, not UCS2.

caf Over a year ago

Fair enough. (That seems somewhat broken - the whole point of widechars was supposed to be that one widechar is always exactly one character).

Miral Over a year ago

That's never true. Even a 32-bit widechar on Linux might represent a non-printing element such as part of a decomposed accented character, or a RTL ordering directive, or all sorts of other things. So it's never safe to assume that one code point is one character, no matter the encoding.

Alok Singhal · Accepted Answer · 2010-02-03 07:18:24Z

According to the C standard, wchar_t type is "capable of representing any character in the current locale". The standard doesn't say what the encoding for wchar_t is. In fact, the limits on WCHAR_MIN and WCHAR_MAX are [0, 255] or [-127, 127], depending upon whether wchar_t is unsigned or signed.

A multibyte character can use more than one byte. A multibyte string is made of one or more multibyte characters. In a multibyte string, each character need not be of equal number of bytes (UTF-8 is an example). Whereas, an object of type wchar_t has a fixed size (in a given implementation, of course).

As an aside, I can also find the following in my copy of the C99 draft:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month.

So, if I understood correctly, if __STDC_ISO_10646__ is defined, then wchar_t can store Unicode characters.

Actual limit on WCHAR_MAX is not 255 (You probably confuse with char type). According to c11 (c99 also have same description): value of **WCHAR_MAX** shall be no less than 255.. Real value may be 2147483647. Live example here. I don't ever seen if it was 255.

Sam Post · Accepted Answer · 2010-02-03 07:22:40Z

Wide character strings are composed of multi-byte characters, whereas the normal C string is a char* - a sequence of byte-wide characters. Wchars are not the same thing as unicode on all platforms, though unicode representations are typically based on wchar_t

I've seen wchars used in embedded systems like phones, where you want filenames with special characters but don't necessarily want to support all the glory and complexity of unicode.

Typical usage would be converting a 2-byte based string to a regular C string, and vica versa

This is perhaps a bit confusing - in this and similar usages, a "multi-byte string" is a string made of chars - a "standard ansi c-string", but where there may be more than one char (byte) per logical character, whereas a wide string typically allots more than 1 byte per element (sizeof(wchar_t)==2 is common), often initially in the mistaken belief that this would allow number of logical characters in a string to equal number of elements.

Collectives™ on Stack Overflow

wcstombs: character encoding?

4 Answers 4

1 Comment

3 Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

1 Comment

1 Comment

Linked

Related