2

I am using python module lxml to parse xml files. However, some of the xml files contain invalid characters such as ® . Due to this, I am getting following error.

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !

Bytes: 0xAE 0x0A 0x53 0x6F, line 45, column 91

-> Removing the character solves the problem.

I cannot tell the data provider to provide me xml without such character. To avoid duplication, I have tried following solution from stack overflow and it gave me same error.

parsed_doc = etree.parse(u, etree.XMLParser(encoding='utf-8', ns_clean=True, recover=True)) 

How do I ignore/escape such characters?

9
  • 2
    Looks like your data is actually encoded in ISO-8859-1. Why not try specifying that as the encoding instead? Commented Jun 6, 2016 at 14:39
  • Thanks, I will try and see if that solves the issue. the top xml tag has utf-8 as encoding attribute. <?xml version="1.0" encoding="UTF-8"?>. Does it mean there is a mistake from the data provider? Commented Jun 6, 2016 at 15:00
  • 0xAE 0x0A 0x53 0x6F means "®\nSo" in Latin-1. Is the XML document using only latin1 or does it mixes latin1 and utf-8 ? In all cases you should at least tell the provider, even if you solve it your side. Commented Jun 6, 2016 at 15:07
  • It mixes latin1 and utf-8. I will tell my provider about the issue. Thanks. Commented Jun 6, 2016 at 15:18
  • 1
    strictly speaking, your xml file is still not well-formed... my view is that xml parser should detect encoding automatically from the xml file... specifying the encoding manually violates the XML spec Commented Jun 6, 2016 at 22:16

1 Answer 1

2

As mentioned by @jwodder, the xml file was not encoded with utf-8 encoding even though it had utf-8 as encoding attribute. . I changed my encoding params to ISO-8859-1 in lxml parser.

parsed_doc = etree.parse(u, etree.XMLParser(encoding='ISO-8859-1', ns_clean=True, recover=True)) 

It worked perfectly.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.