Timeline for Which tools can find unicode (UTF-8) text in files containing any form of normalization?

6 events

when toggle format	what		by	license	comment
Dec 4, 2022 at 21:28	comment	added	Stéphane Chazelas		That's u with two combining characters. See also `u\u0304\u0308` `u\u0308\u0304` which actually have two different precomposed forms: (`U+01D5 LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON` and `U+1E7B LATIN SMALL LETTER U WITH MACRON AND DIAERESIS`). While both `C\u301\u327` and `C\u327\u301` give `U+1E08 LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE` after canonical composition...
Dec 3, 2022 at 14:21	comment	added	Thomas Tempelmann		Since this is not a code related site, I think explaining the method is more helpful here. I only thought of adding precomposed and decomposed writings of the entire search string to the regex alternatives, but your case isn't known to me. Is that just an underline option for the letter ü, or what is that?
Dec 2, 2022 at 10:32	comment	added	Stéphane Chazelas		Unless you share your code, that answer is not going to be very useful to anyone. Are you also taking into account the order in which characters are combined? Like `ū̳` as `uU+0304U+0333` vs `uU+0333U+0304` vs `U+016BU+0333`?
Dec 2, 2022 at 10:10	history	edited	Thomas Tempelmann	CC BY-SA 4.0	added 191 characters in body
Dec 2, 2022 at 10:05	vote	accept	Thomas Tempelmann
Dec 2, 2022 at 10:05	history	answered	Thomas Tempelmann	CC BY-SA 4.0