I am trying to create a method that can detect the encoding schema of a text file. I know there are many out there, but I know for sure my text file with be either ASCII, UTF-8, or UTF-16. I only need to detect these three. Anyone know a way to do this?
2 Answers
First, open the file in binary mode and read it into memory.
For UTF-8 (or ASCII), do a validation check. You can decode the text using Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes) and catch the exception. If you don't get one, the data is valid UTF-8. Here is the code:
private bool detectUTF8Encoding(string filename) { byte[] bytes = File.ReadAllBytes(filename); try { Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes); return true; } catch { return false; } } For UTF-16, check for the BOM (FE FF or FF FE, depending on byte order).
3 Comments
Douglas
For UTF-8, you can also check for the BOM:
EF BB BF. If present, this check would succeed much more quickly than decoding the text.dan04
If present. It's not necessary for UTF-8, and often omitted, especially on Unix-like systems.
Douglas
Yes, that’s true. But since it’s a quick check to perform, it’s worth throwing in for the few times it succeeds.
Use the StreamReader to identify the encoding.
Example:
using(var r = new StreamReader(filename, Encoding.Default)) { richtextBox1.Text = r.ReadToEnd(); var encoding = r.CurrentEncoding; } 3 Comments
dan04
You have to already know the encoding in order to use StreamReader.
Douglas
Dan W
This method will fall back to the user's local encoding if it's not UTF8 which could be desirable. However, it won't be able to detect UTF8 if there's no BOM, even if it's perfectly valid UTF8 text.
\0.