The GET service I try to parse using ElementTree, and whose content I don't control, contains a non-UTF8 special character:
respXML = response.content.decode("utf-8") respRoot = ET.fromstring(respXML) The second line throws
xml.etree.ElementTree.ParseError: reference to invalid character number: line 3591, column 39
How can I make sure that the XML gets parsed regardless of the character set, which I can later run a replacement against if I find illegal characters? For example, is there an encoding which includes everything? I understand I can do a search and replace of the input XML string but I would prefer to parse it first because my parsing converts it into a data structure which is more easily searchable.
The special character in question is  but I would like to be able to ingest any character. The whole tag is <literal>Alzheimers disease</literal>.
unicode entity reference that's the problem. I'm not sure how to add external entities to ElementTree.&xdddd;(where d is a decimal digit) decode to unicode and that one is for the "End of Medium" character which isn't valid xml. The only thing I can think of is replacing it with'before passing it to ET.