60

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.

So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using

if (xml.StartsWith(ByteOrderMarkUtf8)) { xml = xml.Remove(0, ByteOrderMarkUtf8.Length); } 

but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

14 Answers 14

78

I recently had issues with the .NET 4 upgrade, but until then the simple answer is

String.Trim()

removes the BOM up until .NET 3.5.

However, in .NET 4 you need to change it slightly:

String.Trim(new char[]{'\uFEFF'}); 

That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):

String.Trim(new char[]{'\uFEFF','\u200B'}); 

This you could also use to remove other unwanted characters.

Some further information is from String.Trim Method:

The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

Sign up to request clarification or add additional context in comments.

6 Comments

Sorry, your example does not appear to work. Try it with string "\x00EF\x00BB\x00BF<xml/>" under .NET 4.
Didn't completely understand the question I've had trouble with the standard BOM and didnt even recognise the \x00EF\x00BB\x00BF madness you had to deal with
Isn't '\uFEFF' the BOM for UTF16, rather than UTF8?
You know, you're right there, I've never had trouble with the UTF8 BOM (which is on reflection what the question asked - that is indeed the UTF8 one) the UTF16 BOM is what I was having trouble with at the time.
@Cocowalla The corresponding bytes are FEFF in big-endian UTF16, yes, but the preamble character is the same in all encodings.
|
56

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:

private readonly string _byteOrderMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble()); public string GetXmlResponse(Uri resource) { string xml; using (var client = new WebClient()) { client.Encoding = Encoding.UTF8; xml = client.DownloadString(resource); } if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal)) { xml = xml.Remove(0, _byteOrderMarkUtf8.Length); } return xml; } 

Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

8 Comments

Does not seem to work for me. Even "".StartsWith(_byteOrderMarkUtf8) returns true
@pingo Just tried your code in LINQPad 4 and it returned False.
Surprisingly, there's an implementation difference in the StartsWith method that produces different results on different operating systems. See stackoverflow.com/questions/19495318/…
@TrueWill, yes. Otherwise, the results are different when run on Windows 7 vs. Windows 8 or Windows Server 2012 for example.
This is the only approach that worked for me. I used string.Replace() to replace the BOM. Thanks
|
34

This works as well

int index = xmlResponse.IndexOf('<'); if (index > 0) { xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index); } 

4 Comments

Looks simple to me, solved my problem and I think it will solve for other encodings too
Hi Vivek, could you visit the Tridion StackExchange proposal when you have a minute please? area51.stackexchange.com/proposals/38335/tridion We believe the commitment score requires visits from time to time and so is not including you in "users with > 200 rep" figure. Thanks!
this code deserves to be put in a frame, WTF! typical from my consulting days... Please rather use @PJUK solution
I had an invisible crap character at the beginning of my string and end, so I had to do the code presented here as well as something similar to the end of the string: int closingBracket = result.LastIndexOf('>'); if (result.Length > closingBracket + 1) result = result.Remove(closingBracket + 1);
27

A quick and simple method to remove it directly from a string:

private static string RemoveBom(string p) { string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble()); if (p.StartsWith(BOMMarkUtf8, StringComparison.Ordinal)) p = p.Remove(0, BOMMarkUtf8.Length); return p.Replace("\0", ""); } 

How to use it:

string yourCleanString=RemoveBom(yourBOMString); 

Note that StringComparison.Ordinal is important as, depending on the culture the thread is running under, the BOM can be interpreted as an empty string by StartsWith and will always return true. Ordinal will compare the string using binary sort rules.

3 Comments

In my case I needed to strip a UTF-16 BOM. Changing 'Encoding.UTF8' to 'Encoding.Unicode' in the method worked for me.
This is effectively the same as @TrueWill 's answer.
It's not @MatthewDresser. It's smaller, simpler and clean. :)
22

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.

Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

4 Comments

XDocument.Parse does not have an overload that accepts a byte array. I find the statement "you did something wrong" condescending. I would have expected DownloadString to detect the BOM and select the correct encoding.
I think you can get the XDocument also through .Load, passing an XmlReader, which you can get by passing a Stream, for which you can use a MemoryStream. I didn't mean to be condescending; I only tried to point out that the intermediate result that you got is seemingly incorrect, so that the real problem is not that you have to strip those characters, but that they are present in the first place. Perhaps it is the case that there is a flaw in DownloadString, in which case you shouldn't be using it. Perhaps the flaw is in the web server reporting the wrong charset.
OK, thanks. I did find I didn't have the client Encoding set correctly for DownloadString, which gave me a single code point (as you mentioned). It's somewhat moot at this point, as the company providing the "REST" service decided to remove the redundant (for XML in utf-8) BOM.
good call. Using XDocument.Load worked out quite well for me. It's not necessary to use the XmlReader, though, as XDocument.Load takes a stream for an argument.
12

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:

var stream = new MemoryStream(xmlBytes); var document = XDocument.Load(stream); 

It's that simple.

If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):

var bytes = Encoding.UTF8.GetBytes(xml); var stream = new MemoryStream(bytes); var document = XDocument.Load(stream); 

4 Comments

This worked great for me but I had to add an intermediary StreamReader
ie. var doc = XDocument.Load(new StreamReader(new MemoryStream(batchfile)));
Me too, Steven's code doesn't compile. There is no overload of XDocument.Load() that takes a Stream.
Here is the documentation for the XDocument.Load(Stream) overload: msdn.microsoft.com/en-us/library/cc838349.aspx. I guess it's specific to .NET 4, so you must be using .NET 3.5. In that case you would have to use a different overload.
8

I wrote the following post after coming across this issue.

Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

1 Comment

That link is dead. Please avoid writing answers that only link to external resources. Include the link and the relevant sections
5

It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.

Usage:

 string feed = ""; // input bool hadBOM = FixBOMIfNeeded(ref feed); var xElem = XElement.Parse(feed); // now does not fail 

 /// <summary> /// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0]; /// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char. /// </summary> public const char BOMChar = (char)65279; public static bool FixBOMIfNeeded(ref string str) { if (string.IsNullOrEmpty(str)) return false; bool hasBom = str[0] == BOMChar; if (hasBom) str = str.Substring(1); return hasBom; } 

1 Comment

Worked as expected.
5

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.

Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

3 Comments

Thank you for your response; unfortunately this did not work. I used DownloadData and that worked; however, Encoding.UTF8.GetString(byte[]) did not strip the BOM. I tried variants with new UTF8Encoding(false) and (true) without success. Please note that this is UTF-8 data - encoding="utf-8" is specified in the XML header, and it parses correctly once the BOM is removed.
Interesting. I was going to mark this down because I'd been using UTF8Encoding.ASCII.GetString(bytes) which leaves the BOM in but Encoding.UTF8.GetString(bytes) removes it. Upvoted instead
In my tests, both Encoding.UTF8.GetString(byte[] s) and new UTF8Encoding(encoderShouldEmitUTF8Identifier: false).GetString(byte[] s) do not trim BOM.
3

I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):

public static string GetUTF8String(byte[] data) { byte[] utf8Preamble = Encoding.UTF8.GetPreamble(); if (data.StartsWith(utf8Preamble)) { return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length); } else { return Encoding.UTF8.GetString(data); } } 

Where StartsWith(byte[]) is the logical extension:

public static bool StartsWith(this byte[] thisArray, byte[] otherArray) { // Handle invalid/unexpected input // (nulls, thisArray.Length < otherArray.Length, etc.) for (int i = 0; i < otherArray.Length; ++i) { if (thisArray[i] != otherArray[i]) { return false; } } return true; } 

1 Comment

I don't see anything restricting the concept here to UTF-8. Since GetPreamble() belongs to Encoding, it should be possible to genericize to take in the Encoding as a parameter.
2
StreamReader sr = new StreamReader(strFile, true); XmlDocument xdoc = new XmlDocument(); xdoc.Load(sr); 

2 Comments

How does this solve the problem? Can you expand upon it at all?
StreamReader() will handle the BOM.
1

Yet another generic variation to get rid of the UTF-8 BOM preamble:

var preamble = Encoding.UTF8.GetPreamble(); if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble)) preamble = Array.Empty<Byte>(); return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length); 

Comments

0

Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:

certficateThumbprint = Regex.Replace(certficateThumbprint, @"[^a-zA-Z0-9\-\s*]", ""); 

And there you go. Voila!! It worked for me.

Comments

-1

I solved the issue with the following code

using System.Xml.Linq; void method() { byte[] bytes = GetXmlBytes(); XDocument doc; using (var stream = new MemoryStream(docBytes)) { doc = XDocument.Load(stream); } } 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.