2

I have an XML file which contains the following strings:

<field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field> 

In the body of the XML, I have > and < characters, which are not compatible with the XML specification. I need to replace them such that when > and < are in:

 ' "> ' ' " > ' and ' </ ' 

respectively, they should NOT be replaced, all other occurrence of > and < should be replaced by strings "greater than" and "less than". So the result should be like:

 <field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field> 

How can I do that with Python?

5
  • 1
    Better to provide the result, rather then describe it. Commented Nov 10, 2012 at 3:51
  • You could try parsing the entire file using the python regexp module here docs.python.org/2/library/re.html Are all the improper uses of < and > in your file in the case of numerical expressions? If so this should be pretty easy, just replace "# > #", "#> #", and "# >#" with "# is greater than #" and "# < #", "#< #", and "# <#" with "# is less than #" Commented Nov 10, 2012 at 8:15
  • No, they are not all numerical. Basically the problem is I can not come up with a suitable regexp. Commented Nov 10, 2012 at 9:14
  • Your constraints would replace the '<'s at the beginning of each line as they don't fall into any of the 3 cases you provided where they should not be substituted. It might be easier to provide for the cases in which they are subsituted. Commented Nov 10, 2012 at 10:38
  • ok, so what you're saying is that these may not all be numerical comparisons, but they are all comparisons by value? I assume you wouldn't want to translate '>' and '<' to 'greater than' and 'less than' in cases of stream redirection Commented Nov 10, 2012 at 12:11

3 Answers 3

2

You could use lxml.etree.XMLParser with recover=True option:

import sys from lxml import etree invalid_xml = """ <field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field> """ root = etree.fromstring("<root>%s</root>" % invalid_xml, parser=etree.XMLParser(recover=True)) root.getroottree().write(sys.stdout) 

Output

<root> <field name="id">abcdef</field> <field name="intro"> pqrst</field> <field name="desc"> this is a test file. We will show 5&gt;2 and 35 and try to remove non xml compatible characters.</field> </root> 

Note: > is left in the document as &gt; and < is completely removed (as invalid character in xml text).

Regex-based solution

For simple xml-like content you could use re.split() to separate tags from the text and make the substitutions in non-tag text regions:

import re from itertools import izip_longest from xml.sax.saxutils import escape # '<' -> '&lt;' # assumptions: # doc = *( start_tag / end_tag / text ) # start_tag = '<' name *attr [ '/' ] '>' # end_tag = '<' '/' name '>' ws = r'[ \t\r\n]*' # allow ws between any token name = '[a-zA-Z]+' # note: expand if necessary but the stricter the better attr = '{name} {ws} = {ws} "[^"]*"' # note: fragile against missing '"'; no "'" start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >' end_tag = '{ws}'.join(['<', '/', '{name}', '>']) tag = '{start_tag} | {end_tag}' assert '{{' not in tag while '{' in tag: # unwrap definitions tag = tag.format(**vars()) tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE) # escape &, <, > in the text iters = [iter(tag_regex.split(invalid_xml))] * 2 pairs = izip_longest(*iters, fillvalue='') # iterate 2 items at a time print(''.join(escape(text) + tag for text, tag in pairs)) 

To avoid false positives for tags you could remove some of '{ws}' above.

Output

<field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and try to remove non xml compatible characters.</field> 

Note: both <> are escaped in the text.

You could call any function instead of escape(text) above e.g.,

def escape4human(text): return text.replace('<', 'less than').replace('>', 'greater than') 
Sign up to request clarification or add additional context in comments.

2 Comments

lxml-based solution: 3<5 becomes 35
@adray: yes. < is invalid in xml text so the xml parser can't parse it properly and recover=True option allows the parser to skip it.
2

Seems I did it for >:

re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string) 

?<! - negative lookbehind assertion,

?! - negative lookahead assertion,

(...)(...) is logical AND,

so whole expression means "substitute all occurences of '>' which (does not start with ' " ') and (does not start with ' "') and ( does not end with ' ')

case < is similar

Comments

-3

Use ElementTree for XML parsing.

3 Comments

The ElementTree throws an exception because the XML has supposedly misplaced > and < characters.
@Mr_Spock: the point is that the XML is malformed, so ElementTree won't handle it. I just tried the lxml.html variant which can handle some malformed XML as well, but it too fails here.
I wonder why I received a downvote then. The guy's question wasn't even clear then. He didn't state what he was using until AFTER I brought up ElementTree. Seems a little unfair. I won't fret though.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.