Best way to encode text data for XML

Question

I was looking for a generic method in .Net to encode a string for use in an Xml element or attribute, and was surprised when I didn't immediately find one. So, before I go too much further, could I just be missing the built-in function?

Assuming for a moment that it really doesn't exist, I'm putting together my own generic EncodeForXml(string data) method, and I'm thinking about the best way to do this.

The data I'm using that prompted this whole thing could contain bad characters like &, <, ", etc. It could also contains on occasion the properly escaped entities: &, <, and ", which means just using a CDATA section may not be the best idea. That seems kinda klunky anyay; I'd much rather end up with a nice string value that can be used directly in the xml.

I've used a regular expression in the past to just catch bad ampersands, and I'm thinking of using it to catch them in this case as well as the first step, and then doing a simple replace for other characters.

So, could this be optimized further without making it too complex, and is there anything I'm missing? :

Function EncodeForXml(ByVal data As String) As String Static badAmpersand As new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)") data = badAmpersand.Replace(data, "&amp;") return data.Replace("<", "&lt;").Replace("""", "&quot;").Replace(">", "gt;") End Function

Sorry for all you C# -only folks-- I don't really care which language I use, but I wanted to make the Regex static and you can't do that in C# without declaring it outside the method, so this will be VB.Net

Finally, we're still on .Net 2.0 where I work, but if someone could take the final product and turn it into an extension method for the string class, that'd be pretty cool too.

Update The first few responses indicate that .Net does indeed have built-in ways of doing this. But now that I've started, I kind of want to finish my EncodeForXml() method just for the fun of it, so I'm still looking for ideas for improvement. Notably: a more complete list of characters that should be encoded as entities (perhaps stored in a list/map), and something that gets better performance than doing a .Replace() on immutable strings in serial.

Community · Accepted Answer · 2017-05-23 12:26:37Z

Depending on how much you know about the input, you may have to take into account that not all Unicode characters are valid XML characters.

Both Server.HtmlEncode and System.Security.SecurityElement.Escape seem to ignore illegal XML characters, while System.XML.XmlWriter.WriteString throws an ArgumentException when it encounters illegal characters (unless you disable that check in which case it ignores them). An overview of library functions is available here.

Edit 2011/8/14: seeing that at least a few people have consulted this answer in the last couple years, I decided to completely rewrite the original code, which had numerous issues, including horribly mishandling UTF-16.

using System; using System.Collections.Generic; using System.IO; using System.Linq; /// <summary> /// Encodes data so that it can be safely embedded as text in XML documents. /// </summary> public class XmlTextEncoder : TextReader { public static string Encode(string s) { using (var stream = new StringReader(s)) using (var encoder = new XmlTextEncoder(stream)) { return encoder.ReadToEnd(); } } /// <param name="source">The data to be encoded in UTF-16 format.</param> /// <param name="filterIllegalChars">It is illegal to encode certain /// characters in XML. If true, silently omit these characters from the /// output; if false, throw an error when encountered.</param> public XmlTextEncoder(TextReader source, bool filterIllegalChars=true) { _source = source; _filterIllegalChars = filterIllegalChars; } readonly Queue<char> _buf = new Queue<char>(); readonly bool _filterIllegalChars; readonly TextReader _source; public override int Peek() { PopulateBuffer(); if (_buf.Count == 0) return -1; return _buf.Peek(); } public override int Read() { PopulateBuffer(); if (_buf.Count == 0) return -1; return _buf.Dequeue(); } void PopulateBuffer() { const int endSentinel = -1; while (_buf.Count == 0 && _source.Peek() != endSentinel) { // Strings in .NET are assumed to be UTF-16 encoded [1]. var c = (char) _source.Read(); if (Entities.ContainsKey(c)) { // Encode all entities defined in the XML spec [2]. foreach (var i in Entities[c]) _buf.Enqueue(i); } else if (!(0x0 <= c && c <= 0x8) && !new[] { 0xB, 0xC }.Contains(c) && !(0xE <= c && c <= 0x1F) && !(0x7F <= c && c <= 0x84) && !(0x86 <= c && c <= 0x9F) && !(0xD800 <= c && c <= 0xDFFF) && !new[] { 0xFFFE, 0xFFFF }.Contains(c)) { // Allow if the Unicode codepoint is legal in XML [3]. _buf.Enqueue(c); } else if (char.IsHighSurrogate(c) && _source.Peek() != endSentinel && char.IsLowSurrogate((char) _source.Peek())) { // Allow well-formed surrogate pairs [1]. _buf.Enqueue(c); _buf.Enqueue((char) _source.Read()); } else if (!_filterIllegalChars) { // Note that we cannot encode illegal characters as entity // references due to the "Legal Character" constraint of // XML [4]. Nor are they allowed in CDATA sections [5]. throw new ArgumentException( String.Format("Illegal character: '{0:X}'", (int) c)); } } } static readonly Dictionary<char,string> Entities = new Dictionary<char,string> { { '"', "&quot;" }, { '&', "&amp;"}, { '\'', "&apos;" }, { '<', "&lt;" }, { '>', "&gt;" }, }; // References: // [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2 // [2] http://www.w3.org/TR/xml11/#sec-predefined-ent // [3] http://www.w3.org/TR/xml11/#charsets // [4] http://www.w3.org/TR/xml11/#sec-references // [5] http://www.w3.org/TR/xml11/#sec-cdata-sect }

Unit tests and full code can be found here.

Good answer, have seen the similar solution from this article: seattlesoftware.wordpress.com/2008/09/11/…
For the bit (0x100000 <= c && c <= 0x10FFFF) my compiler warns me: "Comparison to integral constant is useless; the constant is outside the range of type 'char'"
Thanks codeulike — pointing out the warning was the kick I needed to finally rewrite the original, buggy code. =) Please try the new code if you get a chance.
+1 for updating your code :) and revisiting the question (helped me out)

Brad C · Accepted Answer · 2017-02-21 21:42:52Z

35

SecurityElement.Escape

documented here

edited Feb 21, 2017 at 21:42

Brad C

3,01224 silver badges36 bronze badges

answered Oct 1, 2008 at 13:47

workmad3

25.7k4 gold badges38 silver badges56 bronze badges

2 Comments

Joel Coehoorn Over a year ago

This seems like what I'm looking for, but there are some comments at the bottom indicating the implementation is less than stellar.

drzaus Over a year ago

link dead

mklement0 · Accepted Answer · 2022-06-08 18:13:01Z

In the past I have used HttpUtility.HtmlEncode to encode text for xml. It performs the same task, really. I haven't run into any issues with it yet, but that's not to say I won't in the future. As the name implies, it was made for HTML, not XML.

You've probably already read it, but here is an article on xml encoding and decoding.

EDIT: Of course, if you use an xmlwriter or one of the new XElement classes, this encoding is done for you. In fact, you could just take the text, place it in a new XElement instance, then return the string (.tostring) version of the element. I've heard that SecurityElement.Escape will perform the same task as your utility method as well, but havent read much about it or used it.

EDIT2: Disregard my comment about XElement, since you're still on 2.0

Note that neither System.Xml.Linq.XText instances nor the System.SecuritySecurityElement.Escape() nor the (made for HTML) System.Web.HttpUtility.HtmlEncode() methods handle translating illegal chars. into character references (e.g,  for ESC). By contrast, System.Xml.XmlDocument instances do.

Jeffrey Knight · Accepted Answer · 2015-11-09 22:44:48Z

Microsoft's ~~AntiXss library~~ AntiXssEncoder Class in System.Web.dll has methods for this:

AntiXss.XmlEncode(string s) AntiXss.XmlAttributeEncode(string s)

it has HTML as well:

AntiXss.HtmlEncode(string s) AntiXss.HtmlAttributeEncode(string s)

Community · Accepted Answer · 2017-05-23 10:31:35Z

~~In .net 3.5+~~

new XText("I <want> to & encode this for XML").ToString();

Gives you:

~~I <want> to & encode this for XML~~

Turns out that this method doesn't encode some things that it should (like quotes).

SecurityElement.Escape (workmad3's answer) seems to do a better job with this and it's included in earlier versions of .net.

If you don't mind 3rd party code and want to ensure no illegal characters make it into your XML, I would recommend Michael Kropat's answer.

& isn't valid XML. I would assume it would use the XML entity: &
It seems that the easiest solution is the best sometimes. Saved me a large chunk of time, mucho appreciated.
@Armstrongest, & is valid XML - see en.wikipedia.org/wiki/…. Ronnie: System.Xml.Linq.XText correctly does not escape " and ', because XML doesn't require it. However, like SecurityElement.Escape it also doesn't handle translating illegal chars. into character references. By contrast, System.Xml.XmlDocument does.

GSerg · Accepted Answer · 2015-07-30 19:01:44Z

5

XmlTextWriter.WriteString() does the escaping.

edited Jul 30, 2015 at 19:01

answered Oct 1, 2008 at 13:48

GSerg

78.8k18 gold badges173 silver badges377 bronze badges

1 Comment

ddotsenko Over a year ago

Or, use it's relative on XmlNode object - .InnerText Getter and Setter decode and encode.

MusiGenesis · Accepted Answer · 2008-10-04 04:03:33Z

4

System.XML handles the encoding for you, so you don't need a method like this.

edited Oct 4, 2008 at 4:03

answered Oct 1, 2008 at 13:46

MusiGenesis

75.6k41 gold badges199 silver badges340 bronze badges

17 Comments

Sekhat Over a year ago

Or go shout at the guys who aren't encoding their xml correctly.

Michael Over a year ago

@Sekhat That's an unreasonable solution. In the real world, large data vendors often cannot be bothered to fix these types of issues, as doing so would break their clients' data.

Michael Over a year ago

@TrevorSullivan That approach works reasonably well in academia, but not so much elsewhere. If you only knew how half-baked some of the financial world's implementations of common specs are (ranging from CRC implementations to things as trivial as XML - I'm speaking from my first hand experience only), you might decide to keep your money in a mattress at home.

MusiGenesis Over a year ago

@Mick: if you knew how mattresses were made today, you might decide to take your money back to the bank.

Don Cheadle Over a year ago

This was accepted? It's not an answer. Sometimes we have to work with code that is using XML strings

|

Kev · Accepted Answer · 2008-10-01 13:46:07Z

3

If this is an ASP.NET app why not use Server.HtmlEncode() ?

answered Oct 1, 2008 at 13:46

Kev

120k53 gold badges308 silver badges396 bronze badges

5 Comments

Joel Coehoorn Over a year ago

This is in a library that will be used for both asp.net apps and batch processing (desktop).

ine Over a year ago

You can actually access Server.HTMLEncode() in a desktop app - all you have to do is ad a reference to System.Web

Dmitry Dzygin Over a year ago

Neither Server.HtmlEncode() nor HttpUtility.HtmlAttributeEncode() replace characters like '\0'

stuartdotnet Over a year ago

Just noting for anyone thinking this is a good idea, System.Web is a big overhead and not really meant for class libraries/windows apps

Kev Over a year ago

@stuartdotnet - hence the caveat "If this is an ASP.NET app".

Dscoduc · Accepted Answer · 2009-01-07 20:30:20Z

This might be the case where you could benefit from using the WriteCData method.

public override void WriteCData(string text) Member of System.Xml.XmlTextWriter Summary: Writes out a <![CDATA[...]]> block containing the specified text. Parameters: text: Text to place inside the CDATA block.

A simple example would look like the following:

writer.WriteStartElement("name"); writer.WriteCData("<unsafe characters>"); writer.WriteFullEndElement();

The result looks like:

<name><![CDATA[<unsafe characters>]]></name>

When reading the node values the XMLReader automatically strips out the CData part of the innertext so you don't have to worry about it. The only catch is that you have to store the data as an innerText value to an XML node. In other words, you can't insert CData content into an attribute value.

Granger · Accepted Answer · 2018-03-19 16:32:58Z

If you're serious about handling all of the invalid characters (not just the few "html" ones), and you have access to System.Xml, here's the simplest way to do proper Xml encoding of value data:

string theTextToEscape = "Something \x1d else \x1D <script>alert('123');</script>"; var x = new XmlDocument(); x.LoadXml("<r/>"); // simple, empty root element x.DocumentElement.InnerText = theTextToEscape; // put in raw string string escapedText = x.DocumentElement.InnerXml; // Returns: Something &#x1D; else &#x1D; &lt;script&gt;alert('123');&lt;/script&gt; // Repeat the last 2 lines to escape additional strings.

It's important to know that XmlConvert.EncodeName() is not appropriate, because that's for entity/tag names, not values. Using that would be like Url-encoding when you needed to Html-encode.

nepaluz · Accepted Answer · 2011-11-18 06:27:40Z

Brilliant! That's all I can say.

Here is a VB variant of the updated code (not in a class, just a function) that will clean up and also sanitize the xml

Function cXML(ByVal _buf As String) As String Dim textOut As New StringBuilder Dim c As Char If _buf.Trim Is Nothing OrElse _buf = String.Empty Then Return String.Empty For i As Integer = 0 To _buf.Length - 1 c = _buf(i) If Entities.ContainsKey(c) Then textOut.Append(Entities.Item(c)) ElseIf (AscW(c) = &H9 OrElse AscW(c) = &HA OrElse AscW(c) = &HD) OrElse ((AscW(c) >= &H20) AndAlso (AscW(c) <= &HD7FF)) _ OrElse ((AscW(c) >= &HE000) AndAlso (AscW(c) <= &HFFFD)) OrElse ((AscW(c) >= &H10000) AndAlso (AscW(c) <= &H10FFFF)) Then textOut.Append(c) End If Next Return textOut.ToString End Function Shared ReadOnly Entities As New Dictionary(Of Char, String)() From {{""""c, "&quot;"}, {"&"c, "&amp;"}, {"'"c, "&apos;"}, {"<"c, "&lt;"}, {">"c, "&gt;"}}

Cosmin · Accepted Answer · 2015-04-23 11:29:51Z

You can use the built-in class XAttribute, which handles the encoding automatically:

using System.Xml.Linq; XDocument doc = new XDocument(); List<XAttribute> attributes = new List<XAttribute>(); attributes.Add(new XAttribute("key1", "val1&val11")); attributes.Add(new XAttribute("key2", "val2")); XElement elem = new XElement("test", attributes.ToArray()); doc.Add(elem); string xmlStr = doc.ToString();

Phillip · Accepted Answer · 2017-03-30 09:55:02Z

Here is a single line solution using the XElements. I use it in a very small tool. I don't need it a second time so I keep it this way. (Its dirdy doug)

StrVal = (<x a=<%= StrVal %>>END</x>).ToString().Replace("<x a=""", "").Replace(">END</x>", "")

Oh and it only works in VB not in C#

Collectives™ on Stack Overflow

Best way to encode text data for XML

13 Answers 13

10 Comments

2 Comments

1 Comment

Comments

3 Comments

1 Comment

17 Comments

5 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

10 Comments

2 Comments

1 Comment

Comments

3 Comments

1 Comment

17 Comments

5 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related