Handling strings with high Unicode codepoints (above U+FFFF)

Question

In Kotlin, how can I iterate over a string that contains Unicode characters above U+FFFF?

Example code:

val s = "Hëllø! € 😀" for (c in s) { println("$c ${c.code}") }

Actual output:

H 72 ë 235 l 108 l 108 ø 248 ! 33 32 € 8364 32 � 55357 � 56832

Desired output:

H 72 ë 235 l 108 l 108 ø 248 ! 33 32 € 8364 32 😀 128512

The Iterator implementation shown in the duplicate target should already do what you want. My answer there goes one step further and combines all the combining marks into a single element too (which you might want to do as well). — Sweeper
– Sweeper, Commented Nov 25, 2024 at 16:26
@Sweeper your duplicate could be the answer the OP needs, but it's not the question he's asked. He is not asking about combining characters. His question can be simply answered thus: Replace for (c in s) with for (c in s.codePoints()). — k314159
– k314159, Commented Nov 25, 2024 at 16:59
@k314159 I’ll wait for OP to respond. If OP doesn’t want combining characters and ZWJ sequences to be combined, I’m happy to reopen. — Sweeper
– Sweeper, Commented Nov 25, 2024 at 17:23
@k314159 You offer half the solution. For a full solution I had to change the body of the loop to var i = IntArray(1); i[0] = c; println("${String(i, 0, 1)} $c"). This compiles with kotlinc, but with Kotlin/Native I get errors like unresolved reference 'codePoints'. Both using Kotlin version 2.0.21. — Peter Kleiweg
– Peter Kleiweg, Commented Nov 26, 2024 at 19:08
The This question already has an answer here refers to a completely different issue. — Peter Kleiweg
– Peter Kleiweg, Commented Nov 26, 2024 at 19:11

k314159 · Accepted Answer · 2024-12-02 10:52:00Z

In Kotlin/JVM, strings are encoded in UTF-16. (To be precise, they may be encoded internally using Latin1 but they still behave externally as if they're encoded in UTF-16.) This means that they're made up of 16-bit characters. To get the actual Unicode code points, including those above U+FFFF, you can use Java's codePoints() method:

val s = "Hëllø! € 😀" for (cp in s.codePoints()) { println("${buildString { appendCodePoint(cp) }} $cp") }

Output:

H 72 ë 235 l 108 l 108 ø 248 ! 33 32 € 8364 32 😀 128512

However, be aware of the presence of combining characters, where multiple Unicode code points are used to make a single grapheme. If you want to support combining characters, then my answer will not help: you will need to look at Sweeper's answer in this question instead.

Unfortunately, on other platforms, Kotlin currently doesn't make it easy to handle Unicode. See this discussion for a list of currently open issues.

Collectives™ on Stack Overflow

Handling strings with high Unicode codepoints (above U+FFFF)

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related