3

The GET service I try to parse using ElementTree, and whose content I don't control, contains a non-UTF8 special character:

respXML = response.content.decode("utf-8") respRoot = ET.fromstring(respXML) 

The second line throws

xml.etree.ElementTree.ParseError: reference to invalid character number: line 3591, column 39

How can I make sure that the XML gets parsed regardless of the character set, which I can later run a replacement against if I find illegal characters? For example, is there an encoding which includes everything? I understand I can do a search and replace of the input XML string but I would prefer to parse it first because my parsing converts it into a data structure which is more easily searchable.

The special character in question is &#25; but I would like to be able to ingest any character. The whole tag is <literal>Alzheimer&#25;s disease</literal>.

11
  • What is in line 3591? Commented Jan 31, 2017 at 19:41
  • I just edited the question, see the last sentence Commented Jan 31, 2017 at 19:45
  • Its not the encoding... its the &#25; unicode entity reference that's the problem. I'm not sure how to add external entities to ElementTree. Commented Jan 31, 2017 at 19:57
  • 1
    Entities of the form &xdddd; (where d is a decimal digit) decode to unicode and that one is for the "End of Medium" character which isn't valid xml. The only thing I can think of is replacing it with &apos; before passing it to ET. Commented Jan 31, 2017 at 20:06
  • 1
    A good argument for scrubbing before it is inserted! This could be some sort of an encoding mismatch such as an mbcs being posted to an assumed utf-8 entry. Commented Jan 31, 2017 at 20:12

1 Answer 1

1

With a little help from @tdelaney, I was able to get past this hurdle by scrubbing the input XML as a string:

respXML = response.content.decode("utf-8") scrubbedXML = re.sub('&.+[0-9]+;', '', respXML) respRoot = ET.fromstring(scrubbedXML) 
Sign up to request clarification or add additional context in comments.

1 Comment

eh? this isn't "getting past" the problem, this is deleting it. You simply removing the the special characters here.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.