Matching patterns in Python

Question

I have an XML file which contains the following strings:

<field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>

In the body of the XML, I have > and < characters, which are not compatible with the XML specification. I need to replace them such that when > and < are in:

 ' "> ' ' " > ' and ' </ '

respectively, they should NOT be replaced, all other occurrence of > and < should be replaced by strings "greater than" and "less than". So the result should be like:

 <field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field>

How can I do that with Python?

You could try parsing the entire file using the python regexp module here docs.python.org/2/library/re.html Are all the improper uses of < and > in your file in the case of numerical expressions? If so this should be pretty easy, just replace "# > #", "#> #", and "# >#" with "# is greater than #" and "# < #", "#< #", and "# <#" with "# is less than #" — Hart Simha
– Hart Simha, Commented Nov 10, 2012 at 8:15
No, they are not all numerical. Basically the problem is I can not come up with a suitable regexp. — rivu
– rivu, Commented Nov 10, 2012 at 9:14
Your constraints would replace the '<'s at the beginning of each line as they don't fall into any of the 3 cases you provided where they should not be substituted. It might be easier to provide for the cases in which they are subsituted. — Hart Simha
– Hart Simha, Commented Nov 10, 2012 at 10:38
ok, so what you're saying is that these may not all be numerical comparisons, but they are all comparisons by value? I assume you wouldn't want to translate '>' and '<' to 'greater than' and 'less than' in cases of stream redirection — Hart Simha
– Hart Simha, Commented Nov 10, 2012 at 12:11

jfs · Accepted Answer · 2012-11-11 23:36:27Z

You could use lxml.etree.XMLParser with recover=True option:

import sys from lxml import etree invalid_xml = """ <field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field> """ root = etree.fromstring("<root>%s</root>" % invalid_xml, parser=etree.XMLParser(recover=True)) root.getroottree().write(sys.stdout)

Output

<root> <field name="id">abcdef</field> <field name="intro"> pqrst</field> <field name="desc"> this is a test file. We will show 5&gt;2 and 35 and try to remove non xml compatible characters.</field> </root>

Note: > is left in the document as > and < is completely removed (as invalid character in xml text).

Regex-based solution

For simple xml-like content you could use re.split() to separate tags from the text and make the substitutions in non-tag text regions:

import re from itertools import izip_longest from xml.sax.saxutils import escape # '<' -> '&lt;' # assumptions: # doc = *( start_tag / end_tag / text ) # start_tag = '<' name *attr [ '/' ] '>' # end_tag = '<' '/' name '>' ws = r'[ \t\r\n]*' # allow ws between any token name = '[a-zA-Z]+' # note: expand if necessary but the stricter the better attr = '{name} {ws} = {ws} "[^"]*"' # note: fragile against missing '"'; no "'" start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >' end_tag = '{ws}'.join(['<', '/', '{name}', '>']) tag = '{start_tag} | {end_tag}' assert '{{' not in tag while '{' in tag: # unwrap definitions tag = tag.format(**vars()) tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE) # escape &, <, > in the text iters = [iter(tag_regex.split(invalid_xml))] * 2 pairs = izip_longest(*iters, fillvalue='') # iterate 2 items at a time print(''.join(escape(text) + tag for text, tag in pairs))

To avoid false positives for tags you could remove some of '{ws}' above.

Output

<field name="id">abcdef</field> <field name="intro" > pqrst</field> <field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and try to remove non xml compatible characters.</field>

Note: both <> are escaped in the text.

You could call any function instead of escape(text) above e.g.,

def escape4human(text): return text.replace('<', 'less than').replace('>', 'greater than')

@adray: yes. < is invalid in xml text so the xml parser can't parse it properly and recover=True option allows the parser to skip it.

adray · Accepted Answer · 2012-11-10 10:45:04Z

Seems I did it for >:

re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string)

?<! - negative lookbehind assertion,

?! - negative lookahead assertion,

(...)(...) is logical AND,

so whole expression means "substitute all occurences of '>' which (does not start with ' " ') and (does not start with ' "') and ( does not end with ' ')

case < is similar

Mr_Spock · Accepted Answer · 2012-11-10 04:03:22Z

-3

Use ElementTree for XML parsing.

answered Nov 10, 2012 at 4:03

Mr_Spock

3,8356 gold badges28 silver badges34 bronze badges

3 Comments

rivu Over a year ago

The ElementTree throws an exception because the XML has supposedly misplaced > and < characters.

Fred Foo Over a year ago

@Mr_Spock: the point is that the XML is malformed, so ElementTree won't handle it. I just tried the lxml.html variant which can handle some malformed XML as well, but it too fails here.

Mr_Spock Over a year ago

I wonder why I received a downvote then. The guy's question wasn't even clear then. He didn't state what he was using until AFTER I brought up ElementTree. Seems a little unfair. I won't fret though.

Collectives™ on Stack Overflow

Matching patterns in Python

3 Answers 3

Output

Regex-based solution

Output

2 Comments

Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Output

Regex-based solution

Output

2 Comments

Comments

3 Comments

Linked

Related