5

I'm writing a JSON parser in Xojo. It's working apart from the fact that I can't figure out how to encode and decode unicode strings that are not in the basic multilingual plane (BMP). In other words, my parser dies if encounters something greater than \uFFFF.

The specs say:

To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.

What I don't understand is what is the algorithm to go from U+1D11E to \uD834\uDD1E. I can't find any explanation of how to "encode the UTF-16 surrogate pair corresponding to the code point".

For example, say I want to encode the smiley face character (U+1F600). What would this be as a UTF-16 surrogate pair and what is the working to derive it?

Could somebody please at least point me in the correct direction?

3
  • 4
    "what is the algorithm to go from U+1D11E to \uD834\uDD1E" - UTF-16, just as the JSON spec says. You do know how UTF-16 works in general, don't you? If not, read its well-documented algorithm. In this case, UTF-16 encodes codepoint U+1D11E as two code units 0xD834 and 0xDD1E, which JSON then encodes as-is in \uXXXX string format. The UTF-16 surrogate pair for codepoint U+1F600 is 0xD83D 0xDE00, so \uD83D\uDE00 in JSON Commented Oct 18, 2018 at 20:04
  • I read the linked article on Wikipedia - it was very helpful. I will answer my own question with an example for future reference for others. Thank you for checking my maths. Commented Oct 18, 2018 at 22:54
  • You have no need to actually make this conversion. stackoverflow.com/q/11641983/209139 Commented Dec 13, 2019 at 11:33

1 Answer 1

5

Taken from the Wikipedia article linked by Remy Lebeau in the comments above (link):

To encode U+10437 (𐐷) to UTF-16:

Subtract 0x10000 from the code point, leaving 0x0437. For the high surrogate, shift right by 10 (divide by 0x400), then add 0xD800, resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the low 10 bits (remainder of dividing by 0x400), then add 0xDC00, resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 (𐐷) from UTF-16:

Take the high surrogate (0xD801) and subtract 0xD800, then multiply by 0x400, resulting in 0x0001 × 0x400 = 0x0400. Take the low surrogate (0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results together (0x0437), and finally add 0x10000 to get the final decoded UTF-32 code point, 0x10437.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.