2

I am trying to create a method that can detect the encoding schema of a text file. I know there are many out there, but I know for sure my text file with be either ASCII, UTF-8, or UTF-16. I only need to detect these three. Anyone know a way to do this?

6
  • Do you know if they have a BOM (byte order mark)? If so, you can use that to determine the type. Commented May 9, 2012 at 19:11
  • 1
    You can safely ignore ASCII. Any valid ASCII file is always a valid UTF-8 file (assuming you’re using the correct 7-bit definition of ASCII). Commented May 9, 2012 at 19:17
  • You are SOL if there is no BOM. Commented May 9, 2012 at 19:19
  • 1
    @MikeCorcoran: Hardly. If you’re dealing with predominantly English text, then there are heuristics which give highly accurate results. For example, you can identify a UTF-16 file because most alternate bytes would be \0. Commented May 9, 2012 at 19:24
  • unfortunately, I don't think there is a BOM. I just looked on a hex editor Commented May 9, 2012 at 20:05

2 Answers 2

4

First, open the file in binary mode and read it into memory.

For UTF-8 (or ASCII), do a validation check. You can decode the text using Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes) and catch the exception. If you don't get one, the data is valid UTF-8. Here is the code:

private bool detectUTF8Encoding(string filename) { byte[] bytes = File.ReadAllBytes(filename); try { Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes); return true; } catch { return false; } } 

For UTF-16, check for the BOM (FE FF or FF FE, depending on byte order).

Sign up to request clarification or add additional context in comments.

3 Comments

For UTF-8, you can also check for the BOM: EF BB BF. If present, this check would succeed much more quickly than decoding the text.
If present. It's not necessary for UTF-8, and often omitted, especially on Unix-like systems.
Yes, that’s true. But since it’s a quick check to perform, it’s worth throwing in for the few times it succeeds.
1

Use the StreamReader to identify the encoding.

Example:

using(var r = new StreamReader(filename, Encoding.Default)) { richtextBox1.Text = r.ReadToEnd(); var encoding = r.CurrentEncoding; } 

3 Comments

You have to already know the encoding in order to use StreamReader.
This method will fall back to the user's local encoding if it's not UTF8 which could be desirable. However, it won't be able to detect UTF8 if there's no BOM, even if it's perfectly valid UTF8 text.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.