You could use lxml.etree.XMLParser with recover=True option:
import sys from lxml import etree invalid_xml = """ <field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field> """ root = etree.fromstring("<root>%s</root>" % invalid_xml, parser=etree.XMLParser(recover=True)) root.getroottree().write(sys.stdout)
Output
<root> <field name="id">abcdef</field> <field name="intro"> pqrst</field> <field name="desc"> this is a test file. We will show 5>2 and 35 and try to remove non xml compatible characters.</field> </root>
Note: > is left in the document as > and < is completely removed (as invalid character in xml text).
Regex-based solution
For simple xml-like content you could use re.split() to separate tags from the text and make the substitutions in non-tag text regions:
import re from itertools import izip_longest from xml.sax.saxutils import escape # '<' -> '<' # assumptions: # doc = *( start_tag / end_tag / text ) # start_tag = '<' name *attr [ '/' ] '>' # end_tag = '<' '/' name '>' ws = r'[ \t\r\n]*' # allow ws between any token name = '[a-zA-Z]+' # note: expand if necessary but the stricter the better attr = '{name} {ws} = {ws} "[^"]*"' # note: fragile against missing '"'; no "'" start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >' end_tag = '{ws}'.join(['<', '/', '{name}', '>']) tag = '{start_tag} | {end_tag}' assert '{{' not in tag while '{' in tag: # unwrap definitions tag = tag.format(**vars()) tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE) # escape &, <, > in the text iters = [iter(tag_regex.split(invalid_xml))] * 2 pairs = izip_longest(*iters, fillvalue='') # iterate 2 items at a time print(''.join(escape(text) + tag for text, tag in pairs))
To avoid false positives for tags you could remove some of '{ws}' above.
Output
<field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>
Note: both <> are escaped in the text.
You could call any function instead of escape(text) above e.g.,
def escape4human(text): return text.replace('<', 'less than').replace('>', 'greater than')