Strip the byte order mark from string in C#

Question

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.

So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using

if (xml.StartsWith(ByteOrderMarkUtf8)) { xml = xml.Remove(0, ByteOrderMarkUtf8.Length); }

but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

Peter Mortensen · Accepted Answer · 2022-02-18 20:38:15Z

I recently had issues with the .NET 4 upgrade, but until then the simple answer is

String.Trim()

removes the BOM up until .NET 3.5.

However, in .NET 4 you need to change it slightly:

String.Trim(new char[]{'\uFEFF'});

That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):

String.Trim(new char[]{'\uFEFF','\u200B'});

This you could also use to remove other unwanted characters.

Some further information is from String.Trim Method:

The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

Sorry, your example does not appear to work. Try it with string "\x00EF\x00BB\x00BF<xml/>" under .NET 4.
Didn't completely understand the question I've had trouble with the standard BOM and didnt even recognise the \x00EF\x00BB\x00BF madness you had to deal with
You know, you're right there, I've never had trouble with the UTF8 BOM (which is on reflection what the question asked - that is indeed the UTF8 one) the UTF16 BOM is what I was having trouble with at the time.
@Cocowalla The corresponding bytes are FEFF in big-endian UTF16, yes, but the preamble character is the same in all encodings.

Community · Accepted Answer · 2017-05-23 12:18:29Z

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:

private readonly string _byteOrderMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble()); public string GetXmlResponse(Uri resource) { string xml; using (var client = new WebClient()) { client.Encoding = Encoding.UTF8; xml = client.DownloadString(resource); } if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal)) { xml = xml.Remove(0, _byteOrderMarkUtf8.Length); } return xml; }

Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

Does not seem to work for me. Even "".StartsWith(_byteOrderMarkUtf8) returns true
@pingo Just tried your code in LINQPad 4 and it returned False.
Surprisingly, there's an implementation difference in the StartsWith method that produces different results on different operating systems. See stackoverflow.com/questions/19495318/…
@TrueWill, yes. Otherwise, the results are different when run on Windows 7 vs. Windows 8 or Windows Server 2012 for example.
This is the only approach that worked for me. I used string.Replace() to replace the BOM. Thanks

Vivek Ayer · Accepted Answer · 2010-07-19 16:22:54Z

34

This works as well

int index = xmlResponse.IndexOf('<'); if (index > 0) { xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index); }

answered Jul 19, 2010 at 16:22

Vivek Ayer

1,14511 silver badges14 bronze badges

4 Comments

Davi Fiamenghi Over a year ago

Looks simple to me, solved my problem and I think it will solve for other encodings too

Rob Stevenson-Leggett Over a year ago

Hi Vivek, could you visit the Tridion StackExchange proposal when you have a minute please? area51.stackexchange.com/proposals/38335/tridion We believe the commitment score requires visits from time to time and so is not including you in "users with > 200 rep" figure. Thanks!

knocte Over a year ago

this code deserves to be put in a frame, WTF! typical from my consulting days... Please rather use @PJUK solution

John Gilmer Over a year ago

I had an invisible crap character at the beginning of my string and end, so I had to do the code presented here as well as something similar to the end of the string: int closingBracket = result.LastIndexOf('>'); if (result.Length > closingBracket + 1) result = result.Remove(closingBracket + 1);

ProgrammingLlama · Accepted Answer · 2023-06-20 02:14:44Z

A quick and simple method to remove it directly from a string:

private static string RemoveBom(string p) { string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble()); if (p.StartsWith(BOMMarkUtf8, StringComparison.Ordinal)) p = p.Remove(0, BOMMarkUtf8.Length); return p.Replace("\0", ""); }

How to use it:

string yourCleanString=RemoveBom(yourBOMString);

Note that StringComparison.Ordinal is important as, depending on the culture the thread is running under, the BOM can be interpreted as an empty string by StartsWith and will always return true. Ordinal will compare the string using binary sort rules.

In my case I needed to strip a UTF-16 BOM. Changing 'Encoding.UTF8' to 'Encoding.Unicode' in the method worked for me.
It's not @MatthewDresser. It's smaller, simpler and clean. :)

Peter Mortensen · Accepted Answer · 2022-02-21 01:41:47Z

22

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.

Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

edited Feb 21, 2022 at 1:41

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Aug 23, 2009 at 4:48

Martin v. Löwis

128k20 gold badges205 silver badges238 bronze badges

4 Comments

TrueWill Over a year ago

XDocument.Parse does not have an overload that accepts a byte array. I find the statement "you did something wrong" condescending. I would have expected DownloadString to detect the BOM and select the correct encoding.

Martin v. Löwis Over a year ago

I think you can get the XDocument also through .Load, passing an XmlReader, which you can get by passing a Stream, for which you can use a MemoryStream. I didn't mean to be condescending; I only tried to point out that the intermediate result that you got is seemingly incorrect, so that the real problem is not that you have to strip those characters, but that they are present in the first place. Perhaps it is the case that there is a flaw in DownloadString, in which case you shouldn't be using it. Perhaps the flaw is in the web server reporting the wrong charset.

TrueWill Over a year ago

OK, thanks. I did find I didn't have the client Encoding set correctly for DownloadString, which gave me a single code point (as you mentioned). It's somewhat moot at this point, as the company providing the "REST" service decided to remove the redundant (for XML in utf-8) BOM.

Steven Oxley Over a year ago

good call. Using XDocument.Load worked out quite well for me. It's not necessary to use the XmlReader, though, as XDocument.Load takes a stream for an argument.

Peter Mortensen · Accepted Answer · 2022-02-18 20:34:54Z

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:

var stream = new MemoryStream(xmlBytes); var document = XDocument.Load(stream);

It's that simple.

If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):

var bytes = Encoding.UTF8.GetBytes(xml); var stream = new MemoryStream(bytes); var document = XDocument.Load(stream);

This worked great for me but I had to add an intermediary StreamReader
ie. var doc = XDocument.Load(new StreamReader(new MemoryStream(batchfile)));
Me too, Steven's code doesn't compile. There is no overload of XDocument.Load() that takes a Stream.
Here is the documentation for the XDocument.Load(Stream) overload: msdn.microsoft.com/en-us/library/cc838349.aspx. I guess it's specific to .NET 4, so you must be using .NET 3.5. In that case you would have to use a different overload.

Andrew Thompson · Accepted Answer · 2011-02-20 21:02:24Z

I wrote the following post after coming across this issue.

Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

That link is dead. Please avoid writing answers that only link to external resources. Include the link and the relevant sections

Nicholas Petersen · Accepted Answer · 2019-04-10 23:25:43Z

It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.

Usage:

 string feed = ""; // input bool hadBOM = FixBOMIfNeeded(ref feed); var xElem = XElement.Parse(feed); // now does not fail

 /// <summary> /// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0]; /// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char. /// </summary> public const char BOMChar = (char)65279; public static bool FixBOMIfNeeded(ref string str) { if (string.IsNullOrEmpty(str)) return false; bool hasBom = str[0] == BOMChar; if (hasBom) str = str.Substring(1); return hasBom; }

Peter Mortensen · Accepted Answer · 2022-02-18 20:32:12Z

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.

Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

Thank you for your response; unfortunately this did not work. I used DownloadData and that worked; however, Encoding.UTF8.GetString(byte[]) did not strip the BOM. I tried variants with new UTF8Encoding(false) and (true) without success. Please note that this is UTF-8 data - encoding="utf-8" is specified in the XML header, and it parses correctly once the BOM is removed.
Interesting. I was going to mark this down because I'd been using UTF8Encoding.ASCII.GetString(bytes) which leaves the BOM in but Encoding.UTF8.GetString(bytes) removes it. Upvoted instead
In my tests, both Encoding.UTF8.GetString(byte[] s) and new UTF8Encoding(encoderShouldEmitUTF8Identifier: false).GetString(byte[] s) do not trim BOM.

ProgrammingLlama · Accepted Answer · 2023-06-20 02:08:09Z

I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):

public static string GetUTF8String(byte[] data) { byte[] utf8Preamble = Encoding.UTF8.GetPreamble(); if (data.StartsWith(utf8Preamble)) { return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length); } else { return Encoding.UTF8.GetString(data); } }

Where StartsWith(byte[]) is the logical extension:

public static bool StartsWith(this byte[] thisArray, byte[] otherArray) { // Handle invalid/unexpected input // (nulls, thisArray.Length < otherArray.Length, etc.) for (int i = 0; i < otherArray.Length; ++i) { if (thisArray[i] != otherArray[i]) { return false; } } return true; }

I don't see anything restricting the concept here to UTF-8. Since GetPreamble() belongs to Encoding, it should be possible to genericize to take in the Encoding as a parameter.

siva.k · Accepted Answer · 2014-08-28 13:48:43Z

2

StreamReader sr = new StreamReader(strFile, true); XmlDocument xdoc = new XmlDocument(); xdoc.Load(sr);

edited Aug 28, 2014 at 13:48

siva.k

1,34414 silver badges25 bronze badges

answered Aug 28, 2014 at 13:42

lucasjam

211 bronze badge

2 Comments

siva.k Over a year ago

How does this solve the problem? Can you expand upon it at all?

Mike S Over a year ago

StreamReader() will handle the BOM.

Vinicius · Accepted Answer · 2019-08-28 19:07:12Z

Yet another generic variation to get rid of the UTF-8 BOM preamble:

var preamble = Encoding.UTF8.GetPreamble(); if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble)) preamble = Array.Empty<Byte>(); return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);

Peter Mortensen · Accepted Answer · 2022-02-18 20:46:26Z

Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:

certficateThumbprint = Regex.Replace(certficateThumbprint, @"[^a-zA-Z0-9\-\s*]", "");

And there you go. Voila!! It worked for me.

Oleg Polezky · Accepted Answer · 2019-11-09 09:46:08Z

I solved the issue with the following code

using System.Xml.Linq; void method() { byte[] bytes = GetXmlBytes(); XDocument doc; using (var stream = new MemoryStream(docBytes)) { doc = XDocument.Load(stream); } }

Collectives™ on Stack Overflow

Strip the byte order mark from string in C#

14 Answers 14

6 Comments

8 Comments

4 Comments

3 Comments

4 Comments

4 Comments

1 Comment

1 Comment

3 Comments

1 Comment

2 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

14 Answers 14

6 Comments

8 Comments

4 Comments

3 Comments

4 Comments

4 Comments

1 Comment

1 Comment

3 Comments

1 Comment

2 Comments

Comments

Comments

Comments

Linked

Related