2

My content contains multiple BOM (EF BB BF) characters and I want to remove them. The characters are in the middle of strings I want to simply remove them all.

The data containing the BOMS comes from a JavaScript source, which I POST to the backend. For now, they are saved as is, but this results in errors in post-processing when the characters are interpreted and start showing up mid-content. I suspect they come from something that was copypasted into my editor.

I can step through the string char by char, but I don't know how to compare against the BOM. Would it somehow be possible to compare the hex values of the string bytes and compare three byte sequences?

2 Answers 2

11

The utf-8 BOM bytes get translated to \ufeff. Unicode character "Zero width no-break space", can't see them, can't hear them. Filter them out with:

 var good = bad.Replace("\ufeff", ""); 
Sign up to request clarification or add additional context in comments.

3 Comments

Great success! One question though, might this cause problems by removing other bytes that get translated into the same unicode character? I doubt that I'll miss any if they get removed but are there other important or worth-mentioning such characters?
You can't see them, you can't hear them.
To replace any ocurence in the string, use: const goodStr = badStr.split('\ufeff').join('');
1

Try the following:

CleanString = DirtyString.Replace("\u00EF\u00BB\u00BF", null); 

2 Comments

The way I tested this was to do string s2 = s.Replace(...) and then Debug.WriteLine(s2);. Then I copy-pasted the output from my output window to Notepad++ and switched to view HEX: I still see the BOM. Did I try it wrong?
That's how it is working for me. Maybe you find this helpful.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.