Timeline for C++ UTF-8 decoder

Current License: CC BY-SA 4.0

7 events

when toggle format	what		by	license	comment
Sep 12, 2023 at 7:16	comment	added	Toby Speight		@Davislor, that's hideous - every day I find a new reason to be glad I never have to support Microsoft platforms!
Sep 9, 2023 at 17:45	comment	added	Davislor		Note that `wchar_t` is only 16 bits wide on MSVC (even though that violates the Standard). The best type for the return value is `char32_t`.
Apr 27, 2023 at 17:22	comment	added	Tau		`mbrtowc` and related functions (besides bringing with them significant performance overhead over the straightforward UTF-8 decoder) are extremely inadvisable simply because of their dependence on global locale. You had the foresight to try and set that in main(), but consider that 1. this is not possible when writing a library, 2. you might be forced to use a library that is itself stupidly locale-dependent and 3. "en_US.utf8" might not even exist on your target machine, in which case you're just completely hosed.
Jan 8, 2023 at 10:45	comment	added	Toby Speight		Yes, that's true. It's not clear why the review code wants to deal with a codepoint at a time, rather than simply transforming an entire string to UCS-4. And of course, codepoints aren't always complete in themselves if combining characters are involved...
Jan 8, 2023 at 4:43	comment	added	Dwayne Robinson		Calling `locale` or `mbrtowc` for every single character is much overhead for a transformation that does not (and should not ever) rely on the current locale. Definitely agree with returning `char32_t` though.
Apr 6, 2021 at 14:31	vote	accept	KlemenPl
Apr 6, 2021 at 13:14	history	answered	Toby Speight	CC BY-SA 4.0