I want to get Kanji's Unicode value. It might be something looks like let values: &[u16] = f("ののの");
When I use "の".as_bytes() I got [227, 129, 174].
When I use 'の'.escape_unicode() I got '\u306e', the 0x306e is what exactly I want.
The char type can be cast to u32 using as. The line
println!("{:x}", 'の' as u32); will print "306e" (using {:x} to format the number as hex).
If you are sure all your characters are in the BMP, you can in theory also cast directly to u16. For characters from supplementary planes this will silently give wrong results, though, e.g. '🝖' as u16 returns 0xf756 instead of the correct 0x1f756, so you need a strong reason to do this.
Internally, a char is stored as a 32-bit number, so c as u32 for some character c only reinterprets the memory representation of the character as an u32.
u16 at all!" It's just misleading and an unncessary "optimization." But kudos for working out and showing that the as u16 silently drops off the higher-order 16 bits of the code point. That's good information to have and nicely researched. I'd suggest phrasing it more as "Don't do this" because you might know your characters are all in the BMP today, but tomorrow they might not be.{:0>4x}. For instance, non-breaking space will be 00a0 instead of a0.
'の' as u16, hex encode. If you want to operate on an entire string and you’re confident that it’s all kanji, you can encode it as UTF-16.as u32would be highly recommended. True that utf-16 good enough for Kanji today but in general that encoding is just a mess. Many characters will fail to give the correct code point withu16."の😎".chars().map(|ch| ch as u32).collect::<Vec<_>>(), though using.chars()directly should be sufficient in most cases. Note that the 😎 needs more than 16 bits.