Skip to main content

Timeline for Filtering invalid utf8

Current License: CC BY-SA 3.0

9 events
when toggle format what by license comment
Jun 30, 2015 at 20:27 history edited Stéphane Chazelas CC BY-SA 3.0
simplified
May 11, 2015 at 17:10 comment added septerr This has been the best solution for removing finding invalid-UTF-8 characters for me. Thanks!
Dec 23, 2014 at 2:11 vote accept Gilles 'SO- stop being evil'
Dec 23, 2014 at 2:11 comment added Gilles 'SO- stop being evil' I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.
Dec 20, 2014 at 0:12 comment added vinc17 @StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs.
Dec 19, 2014 at 23:50 comment added Stéphane Chazelas Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.
Dec 19, 2014 at 23:37 comment added vinc17 @StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.
Dec 19, 2014 at 23:10 comment added Stéphane Chazelas Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.
Dec 19, 2014 at 22:14 history answered vinc17 CC BY-SA 3.0