I'm writing a JSON parser in Xojo. It's working apart from the fact that I can't figure out how to encode and decode unicode strings that are not in the basic multilingual plane (BMP). In other words, my parser dies if encounters something greater than \uFFFF.
The specs say:
To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
What I don't understand is what is the algorithm to go from U+1D11E to \uD834\uDD1E. I can't find any explanation of how to "encode the UTF-16 surrogate pair corresponding to the code point".
For example, say I want to encode the smiley face character (U+1F600). What would this be as a UTF-16 surrogate pair and what is the working to derive it?
Could somebody please at least point me in the correct direction?
U+1D11Eto\uD834\uDD1E" - UTF-16, just as the JSON spec says. You do know how UTF-16 works in general, don't you? If not, read its well-documented algorithm. In this case, UTF-16 encodes codepoint U+1D11E as two code units 0xD834 and 0xDD1E, which JSON then encodes as-is in\uXXXXstring format. The UTF-16 surrogate pair for codepoint U+1F600 is 0xD83D 0xDE00, so\uD83D\uDE00in JSON