10
$\begingroup$

By "cases" I mean uppercase, lowercase, and titlecase.

It seems many languages assumes that there is one-to-one correspondence of uppercase letters and lowercase letters, if the script that the letters belong has a notion of "cases".

However, this is simply a false assumption. There are so many scenarios demonstrating the failure of such assumption. To list a few:

1. The Turkish I Problem:
Most languages that adopted Latin as their script has I and i as an uppercase-lowercase pair. However, Turkish doesn't have this pair, but rather I and ı, and İ and i. And there are some other languages doing this, influenced by the Turkish orthography.

2. The Greek lowercase sigma:
The Greek letter sigma, uppercase Σ and lowercase σ, is the default pair for the sigma. However, at the end of a word, the lowercase σ is not used, and ς is used instead.

3. The Greek subscript iota:
The ancient Greek letter is not an uppercase letter. It's a letter that is titlecase by its own. As such, , when is converted to uppercase, the result must be ΑΙ, which consists of two codepoints.

4. Letters without precomposed uppercase counterpart:
For example, the Latin ligature doesn't have a precomposed uppercase counterpart. The uppercase counterpart is FI, consisting of two codepoints.

5. The Georgian script:
In the modern orthography of Georgian, the "uppercase" letters are simply unused, and only the "lowercase" letters are used. In fact, calling those "uppercase" and "lowercase" was simply a misnomer.

These have few consequences. First of all, since there is simply no one-to-one correspondence of case pairs as single codepoints, the case-converting functions cannot just map characters, but must have a more sophisticated mechanism. Second of all, the Turkish I problem suggests that the case-converting functions must be able to refer to a locale as well.

Are there any other problems regarding cases? Are there a better way to solve such problems?

$\endgroup$
7
  • $\begingroup$ some langs have "simple" case conversion that works on chara and "full" case conversion that works on strings and can change length $\endgroup$ Commented May 31 at 22:08
  • 2
    $\begingroup$ Some languages have characters that have no direct equivalent in the other case. A well known one is the German sharp S ("ß") which is always lower cast, and has no upper case equivalent (though I guess there's been discussion of adding one, simply because lack of it fits so poorly with the rest of the language). You can render "ß" as "ss", and then convert that to upper case to get something vaguely along the right lines but, only vaguely at best. $\endgroup$ Commented Jun 1 at 4:11
  • $\begingroup$ @JerryCoffin An uppercase ß exists for a while now: ẞ $\endgroup$ Commented Jun 1 at 7:50
  • 3
    $\begingroup$ @MisterMiyagi: True, but it's seldom used. Most German speakers still uppercase "ß" as "SS". $\endgroup$ Commented Jun 1 at 11:53
  • 7
    $\begingroup$ Could it be you are mixing "language" with "library" (like C's toupper()? Or maybe give a concrete example where it is a problem. $\endgroup$ Commented Jun 1 at 22:17

5 Answers 5

31
$\begingroup$

As something of a frame challenge, you shouldn't (in general) do any of this in the language, precisely because of all these complexities.

Instead, enable external library bindings to something like International Components for Unicode, which addresses the full range of Unicode-supported functionality, including locale-aware case changes, collation, normalisation, and so on. Allow the user to do the right thing for their actual use case deliberately, rather than providing the attractive nuisance of a language-level feature that doesn't really work for all the reasons you gave and more.

Dealing with Unicode encoding questions in the language is nearly inescapable (if you have anything approaching a text type and interface with the outside), but the natural-language side of things is too complex, too varied, and too dynamic to bake into a programming language unless that is the core thing the language is doing. If it is the core thing you're doing, you'll already know about that and you'll know which axes of variation you're dealing with.

Some programming languages use character case as part of the syntax, rather than data (e.g. to for visibility or namespacing). This is a deliberate trade-off being made, and usually a yes-or-no test on each token rather than something that will be dynamically transformed between cases. They may like to rule out precomposed ligatures like ij and fi entirely as well. If identifiers are expected to be "English" in practice, this is tenable, but that's unlikely to be a suitable restriction for strings in a general-purpose language today.

$\endgroup$
4
  • $\begingroup$ sigh. Vocabulary is part of language, and the standard library is definitely part of a language. Nitpicking about language vs "library" is not nearly as useful as actually answering the question about how the standard library and its text functions should work - at least if you want people to use your language, which means they should want to use your standard library. $\endgroup$ Commented Jun 3 at 13:09
  • 2
    $\begingroup$ This is an argument that these features should not be part of the standard library at all, but left to external libraries used by the programmer for the application at hand; that’s precisely an answer about how the standard library and its text functions should work! $\endgroup$ Commented Jun 3 at 18:03
  • 3
    $\begingroup$ @MichaelHomer: IME, “don't put this in the standard library” = “have a bunch of subtly-incompatible third-party libraries that will eventually need to be awkwardly combined in the same program”. $\endgroup$ Commented Jun 3 at 20:54
  • 1
    $\begingroup$ @dan04: Interestingly, as a Ruby user, my experience has been "put this in the standard library" = "stifle competition and innovation by making better alternatives less discoverable". The Ruby community has been working hard the last 15 years to remove functionality from the stdlib and convert them into packages which can then compete with other packages on a level playing field. Engineering always has tradeoffs. Who would have thought :-D $\endgroup$ Commented Jun 4 at 8:36
13
$\begingroup$

Are there any other problems regarding cases?

In addition to the many fine answers you've gotten so far, let me give you a meta-problem in this area that we ran into a couple decades ago on the C# team. Unicode is a moving target.

One day I was tasked with an extremely confusing bug involving the rules for upper-casing a string; I do not recall the details, but I do recall I had a devil of a time reproducing it. After a lot of effort and grief I discovered the horrifying truth.

The Unicode processing library that the C# compiler had been statically linked against (back when the C# compiler was written in C++) used a different version of the Unicode standard than the .NET runtime standard library used, which was in turn different than the Unicode standard compiled into the version of Windows the user who reported the bug was using. Three different Unicode standards, one machine.

What is the poor compiler writer to do when faced with such a nightmare? If you statically link to a particular standard library then your compiler gets out of date when the user upgrades their operating system, and if you dynamically link then programs can compile differently when the user upgrades their operating system.

I believe we eventually got into a state where the compiler, the runtime library and Windows all agree, and just hoped for the best that upgrading doesn't break programs in the future.

$\endgroup$
10
$\begingroup$

To complement and illustrate the other answers:

In JavaScript:

// The same letter gets a different lowercase version depending on locale: 'I'.toLocaleLowerCase('fr'); // -> i 'I'.toLocaleLowerCase('tr'); // -> ı // The same letter gets a different lowercase version depending on context: 'Σ'.toLocaleLowerCase(); // -> σ 'ΔΕΜΩΣ'.toLocaleLowerCase(); // -> δεμως // A letter gets an uppercase version with two letters 'ᾳ'.toLocaleUpperCase(); // -> ΑΙ 'fi'.toLocaleUpperCase(); // -> FI // No (visible) change in case (but they're actually different code points) 'დემონსტრაცია'.toLocaleUpperCase(); // -> ᲓᲔᲛᲝᲜᲡᲢᲠᲐᲪᲘᲐ 

As you can see:

  • All of your examples are handled as you expect them to be
  • It does not use single characters (or symbols, code points...) but strings as input and output
  • It can be locale-specific

But all the details of the implementation are not part of the language, but use the ICU library. Don't even think about trying to re-invent the wheel, especially here, it's an extremely complex topic. Just provide appropriate bindings to the ICU.

$\endgroup$
2
  • 1
    $\begingroup$ FWIW, the last example in Georgian does render differently on my computer. The uppercase version has the top and bottom of all letters aligned. $\endgroup$ Commented Jun 3 at 10:20
  • $\begingroup$ @zdimension Same on mine $\endgroup$ Commented Jun 4 at 0:52
5
$\begingroup$

It shouldn't be assumed that human languages in general work like English, or even mostly so.

The existence of "uppercase" and "lowercase" puts English into the class of being a "bicameral script", which most languages aren't. The rules for the use of each case are complicated (and sometimes dependent on meaning or context), different for each language (amongst those that have cases), and not always fully standard.

English is one of only some of the bicameral scripts that have a consistent one-to-one mapping between upper and lower cases, with even languages as closely related as German not having this quality.

The casing of already-written text is only rarely "modified" in everyday life, so the operation of doing so is poorly defined. Organisations that deal with a lot of text, like publishers, tend to have their own complicated style guides.

Any modification of the casing of arbitrary mixed-case text, which doesn't obliterate the casing or commandeer it for a non-standard purpose according to special rules, is not typically a job well-suited for computerisation, and instead that kind of textual modification would normally be done by a person proficient in the relevant language.

Most computer applications are written in a culturally-specific way by programmers familiar with that culture, not just national cultures but occupational or even firm-specific cultures. Programming languages themselves are culturally-specific.

The UPPER and LOWER algorithms that some programming languages have, are generally English-centric and intended to operate on codewords, to obliterate any casing distinction in an application that enforces non-distinction, or to commandeer and impose a casing systematically in an application-defined way (for example, for presentation, a surname may be capitalised and forenames presented in title case).

In programming language design, the best advice would be to choose your own culture that you know best, and write your algorithms for that culture. Don't foolishly think that you're going to single-handedly reduce all cultures in the world to a common conceptualisation - your language will just end up with vast complexities both in the design by you and in the usage by others.

The work of the Unicode consortium has run for decades, and the primary goal is to ensure written work can be consistently represented, stored, transferred, and displayed by computer. There is not even any attempt to tackle algorithms that actually manipulate text or define a standard set of those algorithms.

$\endgroup$
4
  • $\begingroup$ "with even languages as closely related as German not having this quality." What would be an example for German where the quality is not satisfied? A-ZÄÖÜẞ maps to a-zäöüß. $\endgroup$ Commented Jun 2 at 7:35
  • 2
    $\begingroup$ @infinitezero, well, reading this on my phone (which is about 12 months old), the last character of your capitalised set appears as a placeholder square! The capitalised version of ß in German is an innovation from very recent years, probably to facilitate certain mechanical transformations of casing. The problem is that there will still be vast amounts of older text in existence (not to mention users of the German writing system whose habits are already formed) that represents the uppercase ß as SS, confounding the correctness of any trivial lowercasing algorithm. (1/2) $\endgroup$ Commented Jun 2 at 9:25
  • 9
    $\begingroup$ The issue here isn't that natural languages cannot be redesigned in principle, it is that they aren't going to be in practice. The German national library and centuries of archives are not going to be rewritten. About 100m living people who write German are not going to be reschooled. We would want the computer to correctly handle German as we know it, and not some new German-like language invented by a programmer who spent about 5 minutes thinking about the issue and decided German language needed to change in order to make it easier for him to program a case-changing algorithm. (2/2) $\endgroup$ Commented Jun 2 at 9:26
  • 1
    $\begingroup$ @Steve: Well, there have been cases of languages radically changing orthography. Such as Turkish's transition from Arabic to Latin script. Or even Russian dropping 4 letters from its alphabet in 1918. But those were products of revolutionary zeal (and the argument that it would improve literacy), not because the reformers wanted to simplify machine collation of text. $\endgroup$ Commented Jun 3 at 21:11
3
$\begingroup$

6. The German Sharp S

The letter ß was traditionally lowercase-only because it never comes at the beginning of a word in German. But that raises the question of how to represent it in all-caps text. The official Unicode case mapping is SS for uppercase and Ss for titlecase. But this can create ambiguity: Does “MASSE” mean “Masse” or “Maße”? So eventually, Unicode introduced a capital (U+1E9E) for use in all-caps text, which lowercases to ß.

7. Lithuanian accent marks

In most Latin-script languages with diacritics, the letters i and j lose their dots when an accent mark is added. For example, in Spanish, i with a stress accent is í.

In Lithuanian, however, the case pairs Ì/i̇̀, and Í/i̇́, and Ĩ/i̇̃ exist, with the lowercase versions including the character U+0307 COMBINING DOT ABOVE. This will need to be handled as a language-specific special case as in Turkish.

8. Different length UTF-8 sequences

Even if the uppercase and lowercase versions of a string happen to contain the same number of Unicode code points, that doesn't mean that the in-memory representation will be the same length. For example, within the Latin script:

  • ı (C4 B1) vs. I (49)
  • ſ (C5 BF) vs. S (53)
  • ȿ (C8 BF) vs. (E2 B1 BE)
  • ɀ (C9 80) vs. Ɀ (E2 B1 BF)
  • ɐ (C9 90) vs. (E2 B1 AF)
  • ɑ (C9 91) vs. (E2 B1 AD)
  • ɒ (C9 92) vs. (E2 B1 B0)
  • ɜ (C9 9C) vs. (EA 9E AB)
  • ɡ (C9 A1) vs. (EA 9E AC)
  • ɥ (C9 A5) vs. (EA 9E 8D)
  • ɦ (C9 A6) vs. (EA 9E AA)
  • ɪ (C9 AA) vs. (EA 9E AE)
  • ɫ (C9 AB) vs. (E2 B1 A2)
  • ɬ (C9 AC) vs. (EA 9E AD)
  • ɱ (C9 B1) vs. (E2 B1 AE)
  • ɽ (C9 BD) vs. (E2 B1 A4)
  • ʂ (CA 82) vs. (EA 9F 85)
  • ʇ (CA 87) vs. (EA 9E B1)
  • ʝ (CA 9D) vs. (EA 9E B2)
  • ʞ (CA 9E) vs. (EA 9E B0)
  • (E2 B1 A5) vs. Ⱥ (C8 BA)
  • (E2 B1 A6) vs. Ⱦ (C8 BE)

So, even if you ignore “special” case pairs that aren't a one-to-one mapping, you still can't assume that case-folding a string will preserve its byte length.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.