Timeline for Which tools can find unicode (UTF-8) text in files containing any form of normalization?
Current License: CC BY-SA 4.0
6 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Dec 4, 2022 at 21:28 | comment | added | Stéphane Chazelas | That's u with two combining characters. See also u\u0304\u0308 u\u0308\u0304 which actually have two different precomposed forms: (U+01D5 LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON and U+1E7B LATIN SMALL LETTER U WITH MACRON AND DIAERESIS). While both C\u301\u327 and C\u327\u301 give U+1E08 LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE after canonical composition... | |
| Dec 3, 2022 at 14:21 | comment | added | Thomas Tempelmann | Since this is not a code related site, I think explaining the method is more helpful here. I only thought of adding precomposed and decomposed writings of the entire search string to the regex alternatives, but your case isn't known to me. Is that just an underline option for the letter ü, or what is that? | |
| Dec 2, 2022 at 10:32 | comment | added | Stéphane Chazelas | Unless you share your code, that answer is not going to be very useful to anyone. Are you also taking into account the order in which characters are combined? Like ū̳ as uU+0304U+0333 vs uU+0333U+0304 vs U+016BU+0333? | |
| Dec 2, 2022 at 10:10 | history | edited | Thomas Tempelmann | CC BY-SA 4.0 | added 191 characters in body |
| Dec 2, 2022 at 10:05 | vote | accept | Thomas Tempelmann | ||
| Dec 2, 2022 at 10:05 | history | answered | Thomas Tempelmann | CC BY-SA 4.0 |