10

I have xml data in byte[] byteArray which may or mayn't contain BOM. Is there any standard way in C# to remove BOM from it? If not, what is the best way, which handles all the cases including all types of encoding, to do the same?

Actually, I am fixing a bug in the code and I don't want to change much of the code. So it would be better if someone can give me the code to remove BOM.

I know that I can do like find out 60 which is ASCII value of '<' and ignore bytes before that but I don't want to do that.

2
  • Can the data be either UTF-8 (with or without byte-order-mark) or UTF16 (with or withour BOM; little-endian or big-endian)? Commented Mar 18, 2013 at 12:04
  • I have edited your title. Please see, "Should questions include “tags” in their titles?", where the consensus is "no, they should not". Commented Mar 18, 2013 at 13:07

5 Answers 5

10

All of the C# XML parsers will automatically handle the BOM for you. I'd recommend using XDocument - in my opinion it provides the cleanest abstraction of XML data.

Using XDocument as an example:

using (var stream = new memoryStream(bytes)) { var document = XDocument.Load(stream); ... } 

Once you have an XDocument you can then use it to omit the bytes without the BOM:

using (var stream = new MemoryStream()) using (var writer = XmlWriter.Create(stream)) { writer.Settings.Encoding = new UTF8Encoding(false); document.WriteTo(writer); var bytesWithoutBOM = stream.ToArray(); } 
Sign up to request clarification or add additional context in comments.

6 Comments

actually i want to remove BOM only and don't have to care about parsing and all. I have updated the question as well.
@RaviGupta I see, do you know the encoding?
it would be better if the logic be encoding free.
@RaviGupta Answer updated. There may be a more efficient way, perhaps looking at the internals of XmlReader to see how they detect the BOM, however what I have written above should work fine.
can we do it for all encoding? like instead of doing writer.Settings.Encoding = new UTF8Encoding(false); can we do writer.Settings.Encoding = new Encoding .... something like that
|
3

You could do something like this to skip the BOM bytes while reading from a stream. You would need to extend the Bom.cs to include further encodings, however afaik UTF is the only encoding using BOM... could (most likely) be wrong about that though.

I got the info on the encoding types from here

using (var stream = File.OpenRead("path_to_file")) { stream.Position = Bom.GetCursor(stream); } public static class Bom { public static int GetCursor(Stream stream) { // UTF-32, big-endian if (IsMatch(stream, new byte[] {0x00, 0x00, 0xFE, 0xFF})) return 4; // UTF-32, little-endian if (IsMatch(stream, new byte[] { 0xFF, 0xFE, 0x00, 0x00 })) return 4; // UTF-16, big-endian if (IsMatch(stream, new byte[] { 0xFE, 0xFF })) return 2; // UTF-16, little-endian if (IsMatch(stream, new byte[] { 0xFF, 0xFE })) return 2; // UTF-8 if (IsMatch(stream, new byte[] { 0xEF, 0xBB, 0xBF })) return 3; return 0; } private static bool IsMatch(Stream stream, byte[] match) { stream.Position = 0; var buffer = new byte[match.Length]; stream.Read(buffer, 0, buffer.Length); return !buffer.Where((t, i) => t != match[i]).Any(); } } 

Comments

3

You don't have to worry about BOM.

If for some reason you need to use and XmlDocument object maybe this code can help you:

byte[] file_content = {wherever you get it}; XmlDocument xml = new XmlDocument(); xml.Load(new MemoryStream(file_content)); 

It worked for me when i tried to download an xml attachment from a gmail account using Google Api and the file have BOM and using Encoding.UTF8.GetString(file_content) didn't work "properly".

Comments

2

What you can also do is use a StreamReader.

Assuming you have a MemoryStream ms

 using (StreamReader sr = new StreamReader(new MemoryStream(ms.ToArray()), Encoding.UTF8)) { var bytesWithoutBOM = new UTF8Encoding(false).GetBytes(sr.ReadToEnd()); var stringWithoutBOM = Convert.ToBase64String(bytesWithoutBOM ); } 

Comments

0

You'll have to identify the byte order marks at the beginning of the byte array. There are several different combinations, as described at http://www.unicode.org/faq/utf_bom.html#bom1.

Just create a little state machine that starts at the beginning of the byte array and looks for those sequences.

I don't know how your array is used or what other parameters you use with it, so I can't really say how you'd "remove" the sequence. Your options appear to be:

  1. If you have start and count parameters, you can just change those to reflect the starting point of the array (beyond the BOM).
  2. If you just have a count parameter (other than the array's Length property), you can move data in the array to overwrite the BOM, and adjust the count accordingly.
  3. If you don't have start or count parameters, then you'll want to create a new array that's the size of the old array minus the BOM, and copy the data into the new array.

To "remove" the sequence, you'd probably want to identify the mark if it's there and then copy the remaining bytes to a new byte array. Or, if you maintain a count of characters (other than the array's Length property)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.