Skip to main content

Timeline for C++ UTF-8 decoder

Current License: CC BY-SA 4.0

7 events
when toggle format what by license comment
Sep 12, 2023 at 7:16 comment added Toby Speight @Davislor, that's hideous - every day I find a new reason to be glad I never have to support Microsoft platforms!
Sep 9, 2023 at 17:45 comment added Davislor Note that wchar_t is only 16 bits wide on MSVC (even though that violates the Standard). The best type for the return value is char32_t.
Apr 27, 2023 at 17:22 comment added Tau mbrtowc and related functions (besides bringing with them significant performance overhead over the straightforward UTF-8 decoder) are extremely inadvisable simply because of their dependence on global locale. You had the foresight to try and set that in main(), but consider that 1. this is not possible when writing a library, 2. you might be forced to use a library that is itself stupidly locale-dependent and 3. "en_US.utf8" might not even exist on your target machine, in which case you're just completely hosed.
Jan 8, 2023 at 10:45 comment added Toby Speight Yes, that's true. It's not clear why the review code wants to deal with a codepoint at a time, rather than simply transforming an entire string to UCS-4. And of course, codepoints aren't always complete in themselves if combining characters are involved...
Jan 8, 2023 at 4:43 comment added Dwayne Robinson Calling locale or mbrtowc for every single character is much overhead for a transformation that does not (and should not ever) rely on the current locale. Definitely agree with returning char32_t though.
Apr 6, 2021 at 14:31 vote accept KlemenPl
Apr 6, 2021 at 13:14 history answered Toby Speight CC BY-SA 4.0