Timeline for Filtering invalid utf8
Current License: CC BY-SA 3.0
9 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jun 30, 2015 at 20:27 | history | edited | Stéphane Chazelas | CC BY-SA 3.0 | simplified |
| May 11, 2015 at 17:10 | comment | added | septerr | This has been the best solution for removing finding invalid-UTF-8 characters for me. Thanks! | |
| Dec 23, 2014 at 2:11 | vote | accept | Gilles 'SO- stop being evil' | ||
| Dec 23, 2014 at 2:11 | comment | added | Gilles 'SO- stop being evil' | I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance. | |
| Dec 20, 2014 at 0:12 | comment | added | vinc17 | @StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs. | |
| Dec 19, 2014 at 23:50 | comment | added | Stéphane Chazelas | Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here. | |
| Dec 19, 2014 at 23:37 | comment | added | vinc17 | @StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4. | |
| Dec 19, 2014 at 23:10 | comment | added | Stéphane Chazelas | Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF. | |
| Dec 19, 2014 at 22:14 | history | answered | vinc17 | CC BY-SA 3.0 |