How to remove BOM from byte array

Question

I have xml data in byte[] byteArray which may or mayn't contain BOM. Is there any standard way in C# to remove BOM from it? If not, what is the best way, which handles all the cases including all types of encoding, to do the same?

Actually, I am fixing a bug in the code and I don't want to change much of the code. So it would be better if someone can give me the code to remove BOM.

I know that I can do like find out 60 which is ASCII value of '<' and ignore bytes before that but I don't want to do that.

Can the data be either UTF-8 (with or without byte-order-mark) or UTF16 (with or withour BOM; little-endian or big-endian)? — Jeppe Stig Nielsen
– Jeppe Stig Nielsen, Commented Mar 18, 2013 at 12:04
I have edited your title. Please see, "Should questions include “tags” in their titles?", where the consensus is "no, they should not". — John Saunders
– John Saunders, Commented Mar 18, 2013 at 13:07

Rich O'Kelly · Accepted Answer · 2013-03-18 13:05:16Z

10

All of the C# XML parsers will automatically handle the BOM for you. I'd recommend using XDocument - in my opinion it provides the cleanest abstraction of XML data.

Using XDocument as an example:

using (var stream = new memoryStream(bytes)) { var document = XDocument.Load(stream); ... }

Once you have an XDocument you can then use it to omit the bytes without the BOM:

using (var stream = new MemoryStream()) using (var writer = XmlWriter.Create(stream)) { writer.Settings.Encoding = new UTF8Encoding(false); document.WriteTo(writer); var bytesWithoutBOM = stream.ToArray(); }

edited Mar 18, 2013 at 13:05

answered Mar 18, 2013 at 11:53

Rich O'Kelly

41.8k9 gold badges87 silver badges114 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ravi Gupta Over a year ago

actually i want to remove BOM only and don't have to care about parsing and all. I have updated the question as well.

Rich O'Kelly Over a year ago

@RaviGupta I see, do you know the encoding?

Ravi Gupta Over a year ago

it would be better if the logic be encoding free.

Rich O'Kelly Over a year ago

@RaviGupta Answer updated. There may be a more efficient way, perhaps looking at the internals of XmlReader to see how they detect the BOM, however what I have written above should work fine.

Ravi Gupta Over a year ago

can we do it for all encoding? like instead of doing writer.Settings.Encoding = new UTF8Encoding(false); can we do writer.Settings.Encoding = new Encoding .... something like that

|

Ross Jones · Accepted Answer · 2013-05-01 09:43:53Z

You could do something like this to skip the BOM bytes while reading from a stream. You would need to extend the Bom.cs to include further encodings, however afaik UTF is the only encoding using BOM... could (most likely) be wrong about that though.

I got the info on the encoding types from here

using (var stream = File.OpenRead("path_to_file")) { stream.Position = Bom.GetCursor(stream); } public static class Bom { public static int GetCursor(Stream stream) { // UTF-32, big-endian if (IsMatch(stream, new byte[] {0x00, 0x00, 0xFE, 0xFF})) return 4; // UTF-32, little-endian if (IsMatch(stream, new byte[] { 0xFF, 0xFE, 0x00, 0x00 })) return 4; // UTF-16, big-endian if (IsMatch(stream, new byte[] { 0xFE, 0xFF })) return 2; // UTF-16, little-endian if (IsMatch(stream, new byte[] { 0xFF, 0xFE })) return 2; // UTF-8 if (IsMatch(stream, new byte[] { 0xEF, 0xBB, 0xBF })) return 3; return 0; } private static bool IsMatch(Stream stream, byte[] match) { stream.Position = 0; var buffer = new byte[match.Length]; stream.Read(buffer, 0, buffer.Length); return !buffer.Where((t, i) => t != match[i]).Any(); } }

prueba prueba · Accepted Answer · 2019-02-17 01:52:44Z

You don't have to worry about BOM.

If for some reason you need to use and XmlDocument object maybe this code can help you:

byte[] file_content = {wherever you get it}; XmlDocument xml = new XmlDocument(); xml.Load(new MemoryStream(file_content));

It worked for me when i tried to download an xml attachment from a gmail account using Google Api and the file have BOM and using Encoding.UTF8.GetString(file_content) didn't work "properly".

Shiroy · Accepted Answer · 2021-06-16 12:55:21Z

What you can also do is use a StreamReader.

Assuming you have a MemoryStream ms

 using (StreamReader sr = new StreamReader(new MemoryStream(ms.ToArray()), Encoding.UTF8)) { var bytesWithoutBOM = new UTF8Encoding(false).GetBytes(sr.ReadToEnd()); var stringWithoutBOM = Convert.ToBase64String(bytesWithoutBOM ); }

Jim Mischel · Accepted Answer · 2013-03-18 13:13:38Z

You'll have to identify the byte order marks at the beginning of the byte array. There are several different combinations, as described at http://www.unicode.org/faq/utf_bom.html#bom1.

Just create a little state machine that starts at the beginning of the byte array and looks for those sequences.

I don't know how your array is used or what other parameters you use with it, so I can't really say how you'd "remove" the sequence. Your options appear to be:

If you have start and count parameters, you can just change those to reflect the starting point of the array (beyond the BOM).
If you just have a count parameter (other than the array's Length property), you can move data in the array to overwrite the BOM, and adjust the count accordingly.
If you don't have start or count parameters, then you'll want to create a new array that's the size of the old array minus the BOM, and copy the data into the new array.

To "remove" the sequence, you'd probably want to identify the mark if it's there and then copy the remaining bytes to a new byte array. Or, if you maintain a count of characters (other than the array's Length property)

Collectives™ on Stack Overflow

How to remove BOM from byte array

5 Answers 5

6 Comments

Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

Comments

Comments

Comments

Comments

Related