Xml Python - not well-formed

Question

I have this code:

import requests from xml.dom.minidom import parseString site = 'test.com' r = requests.get('http://bar-navig.yandex.ru/u?ver=2&url=http://%s&show=1' % (site)) #print r.text.encode('utf-8') xmldoc = parseString(r.text.encode('utf-8')) print xmldoc.getElementsByTagName('tcy')[0].attributes['value'].value

So, it works, but if I have in site, for example, 'vk.com' or 'google.ru', I have an error: xml.parsers.expat.ExpatError: not well-formed (invalid token).

How to fix it? Thanks.

Alfe · Accepted Answer · 2014-01-29 08:20:44Z

It's an encoding issue. XML is supposed to be ASCII based unless specified otherwise. This XML source, in particular, specifies that it is encoded as windows-1251.

Try this:

parseString(r.text.encode('windows-1251'))

Then it can be parsed.

The Minidom isn't very clever, otherwise it would be able to figure that out by itself when passed a unicode (which doesn't work).

manpur · Accepted Answer · 2016-06-15 07:15:11Z

I tried using encodings of 'utf-8' and 'utf-16' apart from iso-8859-1 and it did not work (for some of the Indian sites, though I failed to have noticed any non-ascii characters on them). But I switched to Selenium and all solved. Avoiding minidom is not so difficult either as selenium has an interface quite similar to minidom. Cheers!

Collectives™ on Stack Overflow

Xml Python - not well-formed

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related