Determining text file encoding schema

Question

I am trying to create a method that can detect the encoding schema of a text file. I know there are many out there, but I know for sure my text file with be either ASCII, UTF-8, or UTF-16. I only need to detect these three. Anyone know a way to do this?

Do you know if they have a BOM (byte order mark)? If so, you can use that to determine the type. — alexn
– alexn, Commented May 9, 2012 at 19:11
You can safely ignore ASCII. Any valid ASCII file is always a valid UTF-8 file (assuming you’re using the correct 7-bit definition of ASCII). — Douglas
– Douglas, Commented May 9, 2012 at 19:17
@MikeCorcoran: Hardly. If you’re dealing with predominantly English text, then there are heuristics which give highly accurate results. For example, you can identify a UTF-16 file because most alternate bytes would be \0. — Douglas
– Douglas, Commented May 9, 2012 at 19:24
unfortunately, I don't think there is a BOM. I just looked on a hex editor — Icemanind
– Icemanind, Commented May 9, 2012 at 20:05

Dan W · Accepted Answer · 2012-10-11 17:45:44Z

First, open the file in binary mode and read it into memory.

For UTF-8 (or ASCII), do a validation check. You can decode the text using Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes) and catch the exception. If you don't get one, the data is valid UTF-8. Here is the code:

private bool detectUTF8Encoding(string filename) { byte[] bytes = File.ReadAllBytes(filename); try { Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes); return true; } catch { return false; } }

For UTF-16, check for the BOM (FE FF or FF FE, depending on byte order).

For UTF-8, you can also check for the BOM: EF BB BF. If present, this check would succeed much more quickly than decoding the text.
If present. It's not necessary for UTF-8, and often omitted, especially on Unix-like systems.
Yes, that’s true. But since it’s a quick check to perform, it’s worth throwing in for the few times it succeeds.

animaonline · Accepted Answer · 2012-05-09 19:11:40Z

1

Use the StreamReader to identify the encoding.

Example:

using(var r = new StreamReader(filename, Encoding.Default)) { richtextBox1.Text = r.ReadToEnd(); var encoding = r.CurrentEncoding; }

answered May 9, 2012 at 19:11

animaonline

3,7985 gold badges32 silver badges58 bronze badges

3 Comments

dan04 Over a year ago

You have to already know the encoding in order to use StreamReader.

Douglas Over a year ago

This answer is correct. “A StreamReader will try to automatically detect the encoding of a file if there's a BOM when trying to read.”

Dan W Over a year ago

This method will fall back to the user's local encoding if it's not UTF8 which could be desirable. However, it won't be able to detect UTF8 if there's no BOM, even if it's perfectly valid UTF8 text.

Collectives™ on Stack Overflow

Determining text file encoding schema

2 Answers 2

3 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Linked

Related