1

I have this code:

import requests from xml.dom.minidom import parseString site = 'test.com' r = requests.get('http://bar-navig.yandex.ru/u?ver=2&url=http://%s&show=1' % (site)) #print r.text.encode('utf-8') xmldoc = parseString(r.text.encode('utf-8')) print xmldoc.getElementsByTagName('tcy')[0].attributes['value'].value 

So, it works, but if I have in site, for example, 'vk.com' or 'google.ru', I have an error: xml.parsers.expat.ExpatError: not well-formed (invalid token).

How to fix it? Thanks.

0

2 Answers 2

3

It's an encoding issue. XML is supposed to be ASCII based unless specified otherwise. This XML source, in particular, specifies that it is encoded as windows-1251.

Try this:

parseString(r.text.encode('windows-1251')) 

Then it can be parsed.

The Minidom isn't very clever, otherwise it would be able to figure that out by itself when passed a unicode (which doesn't work).

Sign up to request clarification or add additional context in comments.

Comments

0

I tried using encodings of 'utf-8' and 'utf-16' apart from iso-8859-1 and it did not work (for some of the Indian sites, though I failed to have noticed any non-ascii characters on them). But I switched to Selenium and all solved. Avoiding minidom is not so difficult either as selenium has an interface quite similar to minidom. Cheers!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.