Timeline for Filtering invalid utf8

Current License: CC BY-SA 3.0

9 events

when toggle format	what		by	license	comment
Jun 30, 2015 at 20:27	history	edited	Stéphane Chazelas	CC BY-SA 3.0	simplified
May 11, 2015 at 17:10	comment	added	septerr		This has been the best solution for removing finding invalid-UTF-8 characters for me. Thanks!
Dec 23, 2014 at 2:11	vote	accept	Gilles 'SO- stop being evil'
Dec 23, 2014 at 2:11	comment	added	Gilles 'SO- stop being evil'		I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.
Dec 20, 2014 at 0:12	comment	added	vinc17		@StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU `grep` (whose intent is to regard invalid sequences as non-matching), and possible bugs.
Dec 19, 2014 at 23:50	comment	added	Stéphane Chazelas		Sorry, my bad, the behaviour is unspecified in POSIX since `grep` is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.
Dec 19, 2014 at 23:37	comment	added	vinc17		@StéphaneChazelas Conversely, the `-a` is needed by GNU `grep` (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding `-P` should work with GNU `grep` 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.
Dec 19, 2014 at 23:10	comment	added	Stéphane Chazelas		Except for `-a`, that's required to work by POSIX. However GNU `grep` at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.
Dec 19, 2014 at 22:14	history	answered	vinc17	CC BY-SA 3.0