For what characters c does CharacterCode[c] not return the same value as c's Unicode code point? How does one convert c to its Unicode code point with Mathematica in general?
1 Answer
You can see here.
For a character in range U+0000 to U+D7FF or U+E000 to U+FFFF, ToCharacterCode[c] will just return the same value as c's Unicode code point.
For a character in range U+10000 to U+10FFFF, ToCharacterCode[c] will return two numbers, and Mathematica will take it as two characters.
For example:
In[1]:= ToCharacterCode /@ {"$", "€", "𐐷", "𤭢"} Out[1]= {{36}, {8364}, {55297, 56375}, {55378, 57186}} In[2]:= StringLength@"𐐷" Out[2]= 2 In fact, the Unicode code point of "𐐷" is U+10437, which is 66615 in decimal. And {55297, 56375} is just IntegerDigits[66615 - 65536, 1024] + {55296, 56320}.
The following function can convert a Unicode code point to the corresponding Mathematica CharacterCode.
If[# < 65536, {#}, IntegerDigits[# - 65536, 1024] + {55296, 56320}] & - $\begingroup$ What about the characters between U+D7FF and U+E000? $\endgroup$user13253– user132532015-03-19 02:07:43 +00:00Commented Mar 19, 2015 at 2:07
- $\begingroup$ @qazwsx They are not assigned to characters. $\endgroup$alephalpha– alephalpha2015-03-19 02:21:31 +00:00Commented Mar 19, 2015 at 2:21
- $\begingroup$ 55296 is 0xD800. What's 56320? $\endgroup$user13253– user132532015-03-19 03:57:20 +00:00Commented Mar 19, 2015 at 3:57
- $\begingroup$ @qazwsx 0xDC00. $\endgroup$alephalpha– alephalpha2015-03-19 05:50:01 +00:00Commented Mar 19, 2015 at 5:50
- 2$\begingroup$ Uh... this question ask for char -> Unicode codepoint, not vice versa. $\endgroup$user202729– user2027292018-07-04 16:27:06 +00:00Commented Jul 4, 2018 at 16:27