How to get a char's unicode value?

Question

I want to get Kanji's Unicode value. It might be something looks like let values: &[u16] = f("ののの");

When I use "の".as_bytes() I got [227, 129, 174].

When I use 'の'.escape_unicode() I got '\u306e', the 0x306e is what exactly I want.

'の' as u16, hex encode. If you want to operate on an entire string and you’re confident that it’s all kanji, you can encode it as UTF-16. — Ry-
– Ry- ♦, Commented Oct 21, 2018 at 20:28
...though of course if one is looking for code points then as u32 would be highly recommended. True that utf-16 good enough for Kanji today but in general that encoding is just a mess. Many characters will fail to give the correct code point with u16. — Ray Toal
– Ray Toal, Commented Oct 21, 2018 at 20:35
"の😎".chars().map(|ch| ch as u32).collect::<Vec<_>>(), though using .chars() directly should be sufficient in most cases. Note that the 😎 needs more than 16 bits. — starblue
– starblue, Commented Oct 22, 2018 at 12:34

loganfsmyth · Accepted Answer · 2021-02-18 02:12:08Z

24

The char type can be cast to u32 using as. The line

println!("{:x}", 'の' as u32);

will print "306e" (using {:x} to format the number as hex).

If you are sure all your characters are in the BMP, you can in theory also cast directly to u16. For characters from supplementary planes this will silently give wrong results, though, e.g. '🝖' as u16 returns 0xf756 instead of the correct 0x1f756, so you need a strong reason to do this.

Internally, a char is stored as a 32-bit number, so c as u32 for some character c only reinterprets the memory representation of the character as an u32.

edited Feb 18, 2021 at 2:12

loganfsmyth

162k31 gold badges349 silver badges259 bronze badges

answered Oct 21, 2018 at 20:28

Sven Marnach

608k123 gold badges968 silver badges865 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ray Toal Over a year ago

I'd go as far as to say "don't EVER use u16 at all!" It's just misleading and an unncessary "optimization." But kudos for working out and showing that the as u16 silently drops off the higher-order 16 bits of the code point. That's good information to have and nicely researched. I'd suggest phrasing it more as "Don't do this" because you might know your characters are all in the BMP today, but tomorrow they might not be.

aurexav Over a year ago

Thank you. By the way, do you know how to get its Shift JIS value? Should I use a lookup table?

Sven Marnach Over a year ago

@RayToal I agree and changed the wording slightly.

Sven Marnach Over a year ago

@AurevoirXavier I just googled that for you – here you go: stackoverflow.com/questions/48136939/…

BallpointBen May 24 at 15:39

If you want at least four hex digits unconditionally, regardless of the value of the character, you can use {:0>4x}. For instance, non-breaking space will be 00a0 instead of a0.

Collectives™ on Stack Overflow

How to get a char's unicode value?

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related