How to solve encoding problem reading feed

Question

https://sports.ultraplay.net/sportsxml?clientKey=b4dde172-4e11-43e4-b290-abdeb0ffd711&sportId=1165

I'm trying to read this feed in .NET environment and get the BOM issue (System.Xml.XmlException: 'There is no Unicode byte order mark. Cannot switch to Unicode.). How can I solve it? Is it because the xml contents doesn't have an xml declaration tag?

I tried reading the feed all of the possible ways, lets give as an example this one:

XmlReader reader = XmlReader.Create(feedUrl); var content = XDocument.Load(reader);

dana · Accepted Answer · 2018-11-02 16:32:13Z

Apparently the XML Declaration seems to be throwing things off here:

<?xml version="1.0" encoding="utf-16"?>

See: Loading xml with encoding UTF 16 using XDocument

That question addresses the scenario when you have an XML File using StreamReader. Since you are downloading the file from the web, you can adapt a WebClient to a StreamReader using the OpenRead() method as follows:

string feedUrl = "https://sports.ultraplay.net/sportsxml?clientKey=b4dde172-4e11-43e4-b290-abdeb0ffd711&sportId=1165"; System.Xml.Linq.XDocument content; using (System.Net.WebClient webClient = new System.Net.WebClient()) using (System.IO.Stream stream = webClient.OpenRead(feedUrl)) using (System.IO.StreamReader streamReader = new System.IO.StreamReader(stream, Encoding.UTF8)) { content = XDocument.Load(streamReader); } Console.WriteLine(content);

Strangely enough, while the document claims to be UTF-16, the HTTP response say UTF-8 which is why I am specifying that in the StreamReader constructor.

HTTP/1.1 200 OK Date: Fri, 02 Nov 2018 16:28:46 GMT Content-Type: application/xml; charset=utf-8

This seems to work well :)

Well, it's UTF8-Encoded because you ask it to be that way. It doesn't mean that the original page encoding was UTF-8 (actually, it's UTF-16). WebClient uses the specified encoding to encode the result data bytes. It doesn't check whether it matches the Response Encoding. If that page was using a different, specific, encoding, you'll get garbled text. Something related I posted: Kanji characters from WebClient html different from actual Kanji
If you don't specify and Encoding, the procedure checks the BOM of these: Encoding.UTF8, Encoding.UTF32, Encoding.Unicode, Encoding.BigEndianUnicode. Sure thing is, web pages tend to be UTF-8 encoded. But many are not.
For what it's worth, I received the header in Fiddler. Also, I also tried WebClient with Encoding.Unicode but it barfed. UTF-8 seemed to do the trick.
My comment is not specific to this question. It's good to know that the Encoding specified using the WebClient property does not guarantee that the downloaded data will be encoded correctly. Quite the opposite. It's probably better not to specify an Encoding. Unless one is sure what that is. In the answer I linked this is reported. Also, the underlying WebResponse is used to get the actual Encoding, provided by the remote host. The byte[] data is then re-encoded using the correct Encoding. The current question is more related to the XmlReader behaviour with a file Encoding.
Anyway, as per the OP, this will probably solve the immediate problem (encoding in a Unicode form, to accomodate the XmlReader requirements, if that is the tool). If the text is not completely right, there's enough info here to understand why and fix it.

Collectives™ on Stack Overflow

How to solve encoding problem reading feed

1 Answer 1

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Linked

Related