Locale settings are user preferences that relate to your culture.
Locale names
On all current unix variants that I know of (but not on a few antiques), locale names follow the same pattern:
- An ISO 639-1 lowercase two-letter language code, or an ISO 639-2 three-letter language code if the language has no two-letter code. For example,
en for English, de for German, ja for Japanese, uk for Ukrainian, ber for Berber, … - For many but not all languages, an underscore
_ followed by an ISO 3166 uppercase two-letter country code. Thus: en_US for US English, en_UK for British English, fr_CA Canadian (Québec) French, de_DE for German of Germany, de_AT for German of Austria, ja_JP for Japanese (of Japan), etc. - Optionally, a dot
. followed by the name of a character encoding such as UTF-8, ISO-8859-1, KOI8-U, GB2312, Big5, etc. With GNU libc at least (I don't know how widespread this is), case and punctuation is ignored in encoding names. For example, zh_CN.UTF-8 is Mandarin (simplified) Chinese encoded in UTF-8, while zh_CN is Mandarin Chinese encoded in GB2312, and zh_TW is Taiwanese (traditional) Chinese encoded in Big5. - Optionally, an at sign
@ followed by the name of a variant. The meaning of variants is locale-dependent. For example, many European countries have an @euro locale variant where the currency sign is € and where the encoding is one that includes this character (ISO 8859-15 or ISO 8859-16), as opposed to the unadorned variant with the older currency sign. For example, en_IE (English, Ireland) uses the latin1 (ISO 8859-1) encoding and £ as the currency symbol while en_IE@euro uses the latin9 (ISO 8859-15) encoding and € as the currency symbol.
In addition, there are two locale names that exist on all unix-like system: C and POSIX. These names are synonymous and mean computerese, i.e. default settings that are appropriate for data that is parsed by a computer program.
Locale settings
The following locale categories are defined by POSIX:
LC_CTYPE: the character set used by terminal applications: classification data (which characters are letters, punctuation, spaces, invalid, etc.) and case conversion. Text utilities typically heed LC_CTYPE to determine character boundaries. LC_COLLATE: collation (i.e. sorting) order. This setting is of very limited use for several reasons: - Most languages have intricate rules that depend on what is being sorted (e.g. dictionary words and proper names might not use the same order) and cannot be expressed by
LC_COLLATE. - There are few applications where proper sort order matters which are performed by software that uses locale settings. For example, word processors store the language and encoding of a file in the file itself (otherwise the file wouldn't be processed correctly on a system with different locale settings) and don't care about the locale settings specified by the environment.
LC_COLLATE can have nasty side effects, in particular because it causes the sort order A < a < B < …, which makes “between A and Z” include the lowercase letters a through y. In particular, very common regular expressions like [A-Z] break some applications.
LC_MESSAGES: the language of informational and error messages. LC_NUMERIC: number formatting: decimal and thousands separator.
Many applications hard-code . as a decimal separator. This makes LC_NUMERIC not very useful and potentially dangerous: - Even if you set it, you'll still see the default format pretty often.
- You're likely to get into a situation where one application produces locale-dependent output and another application expects
. to be the decimal point, or , to be a field separator.
LC_MONETARY: like LC_NUMERIC, but for amounts of local currency.
Very few applications use this. LC_TIME: date and time formatting: weekday and month names, 12 or 24-hour clock, order of date parts, punctuation, etc.
GNU libc, which you'll find on non-embedded Linux, defines additional locale categories:
LC_PAPER: the default paper size (defined by height and width). LC_NAME, LC_ADDRESS, LC_TELEPHONE, LC_MEASUREMENT, LC_IDENTIFICATION: I don't know of any application that uses these.
Environment variables
Applications that use locale settings determine them from environment variables.
- Then the value of the
LANG environment variable is used unless overridden by another setting. If LANG is not set, the default locale is C. - The
LC_xxx names can be used as environment variables. - If
LC_ALL is set, then all other values are ignored; this is primarily useful to set LC_ALL=C run applications that need to produce the same output regardless of where they are run. - In addition, GNU libc uses
LANGUAGE to define fallbacks for LC_MESSAGES (e.g. LANGUAGE=fr_BE:fr_FR:en to prefer Belgian French, or if unavailable France French, or if unavailable English).
Installing locales
Locale data can be large, so some distributions don't ship them in a usable form and instead require an additional installation step.
- On Debian, to install locales, run
dpkg-reconfigure locales and select from the list in the dialog box, or edit /etc/locale.gen and then run locale-gen. - On Ubuntu, to install locales, run
locale-gen with the names of the locales as arguments.
You can define your own locale.
Recommendation
The useful settings are:
- Set
LC_CTYPE to the language and encoding that you encode your text files in. Ensure that your terminals use that encoding.
For most languages, only the encoding matters. There are a few exceptions; for example, an uppercase i is I in most languages but İ in Turkish (tr_TR). - Set
LC_MESSAGES to the language that you want to see messages in. - Set
LC_PAPER to en_US if you want US Letter to be the default paper size and just about anything else (e.g. en_GB) if you want A4. - Optionally, set
LC_TIME to your favorite time format.
As explained above, avoid setting LC_COLLATE and LC_NUMERIC. If you use LANG, explicitly override these two categories by setting them to C.