Why is a charset included with the locale?

Question

As I understand it POSIX says that a locale definition must include a type of character encoding like UTF-8. Locale is commonly implemented as shell environment variables.

Why aren't these two concepts separated? Now for example we have multiple en_US locales depending on how many encodings are supported. It seems messy and hard to scale from a design perspective, so what's the rationale?

peterh · Accepted Answer · 2017-02-18 15:44:17Z

Some languages, for example Chinese, can be written by multiple character sets, and both of them is correct. They have a newer, simplified character set, and also a more complex, traditional one. Or Mongolian or Serbian can be written both with cyrillic or latinic script.

Furthermore, the accented characters like ä, ö, and similar, of many languages are, have different codes in the different encodings, although they are the same on the screen. Thus, to write an ä into a file, means that you write a different byte sequence into this file if it is encoded by utf8, as if it is encoded on a different standard (in my example its name is iso8859-1 encoding).

Furthermore, there are encodings which can support a single, or some languages. For example, iso8859-1 supports most West European languages, iso8859-2 supports the central and east european ones until they are written with latinic script. Ascii supports only English, but utf8 supports nearly all of the languages of the world.

Thus, there is a many-to-many relation between the languages and the possible encodings: languages can be encoded by multiple encodings, and most encoding can be used for multiple languages.

There is also a many-to-many relation for the encoding compatibility: for example, iso8859-2 is upwardly compatible with ascii, but ebcdic isn't.

There is also a many-to-many relation between the encoding compatibilities. For example, ebcdic can be converted to ascii, but iso8859-2 can't.

So, there is a complex network of partially compatible standards. The actually used encoding belongs exactly so to the locale, as the language. Thus, it has to be handled similarly as the language. This is why is it handled with the same environment variables.

Although it would be possible, and in my opinion, a better way, if it would be handled by a different environment variable. Thus, there would be a different environment variable for the encoding as the language, and it wouldn't be an extension of the language string. The reason for that is mainly historical, on compatibility reasons.

But, at least in glibc, there is support also for different environment variables. This is how a "locale" output on one of my English linuxes look:

$ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

As you can see, there are the different LANG and LANGUAGE environment variables. Unfortunately, this is mainly for some standard compatibility and the system doesn't look really to follow it.

Extension:

About the history: in the ancient times, nearly all of the systems were in US English with ASCII or EBCDIC encoding. Multilang or multiencoding support were unheard, and if they were developed, they used ad hoc solutions (for example, by overwriting the character set bitmaps in the system firmware). Also I developed at the time a latin2 character support for c64 at around 1988 (I made only some characters of the latin2 available, but not the code page - at the time, I didn't even know, what are codepages)). The code pages were a later invention for them, initially to standardize the similar extensions for the ascii. In the mainframe world were ebcdic used, which is inherently incompatible with the ascii, a similar switch happened as well (ebcdic was developed originally to make punched cards easily readable by humans, ascii goals the easy data processing).

All of these encodings used 1 character for 1 byte. In the Linux, multi language support was started in the libc4 era (early nineties). Unicode didn't exist yet, it was a new and unimplemented standard, and all of the software were developed with 1 char = 1 byte suspection. To make utf viable, all of them should been significantly modified, what was a source of major troubles of the following decade.

All of the language extensions used the upper half of the byte (ascii specifies only 7 bit, so the place for ä, ö or for the cyrillic/greek script were between 128-255). Thus, these were incompatible also with eachother. Thus, at the time, the relation between the languages and the codepages was more like one-to-one, as the current many-to-many.

Thus, supporting a language clearly specified also the codepage to use. Thus, there were no support for different codepages, even the codepage as terminology weren't very accepted. You can imagine, how many troubles were caused at the time as win95 has switched from the ibm850 of the win31 to cp1251, while most tool didn't even know that codepages exist.

In the Linuxes, simply the language was determined by 2 characters in the LANG environment variable, and no more. Language dialect support (like "pt_BR" for Brasilian Portuguese instead simply "pt") was already an independent extension for that.

The need to support the same language with multiple encldings were highly urgent with the utf8. Although the problem already existed at time (see Serbian, what can be written with cyrillic and also with Latinic script), it made troubles only for some small languages (as far I know, Serbians simply used latinic script for everything). Thus, the multiple encoding support was a next step to an ongoing, continuous development. And thus, it has followed the common patterns: the further extension of the language string.

Your answer contains a lot of background theory, but doesn't really address the question besides mentioning "historical reasons". Could you expand on what these reasons were? Maybe that UNIX was developed in the US at a time when only ASCII was relevant, and POSIX hacked these extensions to preserve the default en_US locale? — jiggunjer
– jiggunjer, Commented Feb 18, 2017 at 14:39
@jiggunjer Posix is a standard and not hack, but standards try to remain compatible with previous standards. Typically, they are created by commitees, whose reasons behind their decisions aren't always easy to track back. Furthermore, they also like to integrate already existing standards and make a new development from them. Thus, I think, the main reason behind that the encoding was made part of the LANG string, is that this was what the machine in the mind of the members of some commitee said. I extended the answer. — peterh
– peterh, Commented Feb 18, 2017 at 15:49

Stack Exchange Network

Why is a charset included with the locale?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why is a charset included with the locale?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions