0

In Kotlin, how can I iterate over a string that contains Unicode characters above U+FFFF?

Example code:

val s = "Hëllø! € 😀" for (c in s) { println("$c ${c.code}") } 

Actual output:

H 72 ë 235 l 108 l 108 ø 248 ! 33 32 € 8364 32 � 55357 � 56832 

Desired output:

H 72 ë 235 l 108 l 108 ø 248 ! 33 32 € 8364 32 😀 128512 
6
  • The Iterator implementation shown in the duplicate target should already do what you want. My answer there goes one step further and combines all the combining marks into a single element too (which you might want to do as well). Commented Nov 25, 2024 at 16:26
  • 2
    @Sweeper your duplicate could be the answer the OP needs, but it's not the question he's asked. He is not asking about combining characters. His question can be simply answered thus: Replace for (c in s) with for (c in s.codePoints()). Commented Nov 25, 2024 at 16:59
  • @k314159 I’ll wait for OP to respond. If OP doesn’t want combining characters and ZWJ sequences to be combined, I’m happy to reopen. Commented Nov 25, 2024 at 17:23
  • @k314159 You offer half the solution. For a full solution I had to change the body of the loop to var i = IntArray(1); i[0] = c; println("${String(i, 0, 1)} $c"). This compiles with kotlinc, but with Kotlin/Native I get errors like unresolved reference 'codePoints'. Both using Kotlin version 2.0.21. Commented Nov 26, 2024 at 19:08
  • The This question already has an answer here refers to a completely different issue. Commented Nov 26, 2024 at 19:11

1 Answer 1

1

In Kotlin/JVM, strings are encoded in UTF-16. (To be precise, they may be encoded internally using Latin1 but they still behave externally as if they're encoded in UTF-16.) This means that they're made up of 16-bit characters. To get the actual Unicode code points, including those above U+FFFF, you can use Java's codePoints() method:

val s = "Hëllø! € 😀" for (cp in s.codePoints()) { println("${buildString { appendCodePoint(cp) }} $cp") } 

Output:

H 72 ë 235 l 108 l 108 ø 248 ! 33 32 € 8364 32 😀 128512 

However, be aware of the presence of combining characters, where multiple Unicode code points are used to make a single grapheme. If you want to support combining characters, then my answer will not help: you will need to look at Sweeper's answer in this question instead.

Unfortunately, on other platforms, Kotlin currently doesn't make it easy to handle Unicode. See this discussion for a list of currently open issues.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.