Python Regex look behind

Question

I have the following text:

<clipPath id="p54dfe3d8fa"> <path d="M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z "/> </clipPath> <clipPath id="p27c84a8b3c"> <rect height="302.4" width="446.4" x="72.0" y="43.2"/> </clipPath>

I need to grab this portion out:

d="M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z "

I need to replace this section with something else. I was able to grab the entirety of <clipPath ...><path d="[code i want]"/> but this doesn't help me because I can't override the id in the <clipPath> element.

Note that there are other <clipPath> elements that I do not want to touch. I only want to change <path> elements within <clipPath> elements.

I'm thinking that the answer has to do with selecting everything before a clipPath element and ending at the Path section. Any help would be entirely appreciated.

I've been using http://pythex.org/ for help and have also seen odd behavior (having to do with multiline and spaces) that don't act the same between that and python 3.x code.

Here are some of the things I've tried:

reg = r'(<clipPath.* id=".*".*>)' reg = re.compile(r'(<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+")') reg = re.compile(r'((?<!<clipPath).* id=".*".*>\s*<path.*d="(.*\n)+")') g = reg.search(text) g

see also stackoverflow.com/questions/15857818/python-svg-parser — kennytm
– kennytm, Commented Jan 27, 2017 at 20:06
is this an xml ? why won't you do this with xml.etree.ElementTree or lxml ? — PYPL
– PYPL, Commented Jan 27, 2017 at 20:11

Jean-François Fabre · Accepted Answer · 2017-01-27 21:32:07Z

regex is never the proper way of parsing xml.

Here's a simple standalone example which does it using lxml:

from lxml import etree text="""<clipPath id="p54dfe3d8fa"> <path d="M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z "/> </clipPath> <clipPath id="p27c84a8b3c"> <rect height="302.4" width="446.4" x="72.0" y="43.2"/> </clipPath>""" # This creates <metrics> root = etree.XML("<X>"+text+"</X>") p = root.find(".//path") print(p.get("d"))

result:

M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z

first, I create the main node. Since there are several nodes, I wrap it in an arbitrary main node
then I look for "path" anywhere
once found, I get the d attribute

Now I'm changing the text for d and dump it:

p.set("d","[new text]") print(etree.tostring(root))

now the output is like:

... <path d="[new text]"/>\n ...

still, quick and dirty, maybe not robust to several path nodes, but works with the snippet you provided (and I'm no xml expert, just fumbling)

BTW, another hacky/non-regex way of doing it: using multi-character split:

text.split(' d="')[1].split('"/>')[0]

taking the second part after d delimiter, then the first part after /> delimiter. Preserves the multi-line formatting.

never say never ;) it may be better practice to use lxml or similar, but the author also stated he wanted to learn regex
@Aaron I understand that, then the OP should practice on something else than a nested syntax language with several possible syntaxes, like for instance a text file with line-per-line data. If OP wanted to parse C or Java with regex, it would be foolish as well.
(Plus since I don't know zit about xml and I could find my way by trial and error in a few minutes, that could convince more xml newbies that it's not that hard to use, even if I personally hate xml)
In my personal experience, I sometimes run into poorly formed xml that requires some regex lovin'

Aaron · Accepted Answer · 2017-01-27 20:38:25Z

TL;DR: r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d=("(?:.*\n)+?")'

let's break that down...

you started with: r'(<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+")' which enclosed your entire capture pattern inside a group, so the whole element would be captured in the match object. Let's take out those parenthesis: r'<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+"'

next you seem to use .* quite often, which can be dangerous because it is blind and greedy. for the clipPath id, if you know the id is always alphanumeric, a better solution might be r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d="(.*\n)+"'

finally, let's look at what you actually want to capture. your example shows you want to capture the quotation marks, so let's get those inside our capture group: ...*d=("(.*\n)+"). This leaves us with a weird nested group situation though, so let's make the inner group non-capturing: ...*d=("(?:.*\n)+").

now we're capturing what you want, but we still have a problem... what if there are multiple elements that satisfy these criteria? the greedy matching of the + in ...*d=("(.*\n)+") will capture ever line in-between. What we can do here is to make the + non greedy by following it with a ?: ...*d=("(?:.*\n)+?").

put all these things together:

r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d=("(?:.*\n)+?")'

This pretty much directly answers my question using a regex and that's awesome! I believe that one of the SVG/XML parsing libs is what I'm going to ultimately go with but this is going to get marked as correct because of your effort and explanation. Thanks!

Stephen Rauch · Accepted Answer · 2017-01-27 20:36:21Z

An xml based solution that edits the path.

import xml.dom.minidom # Open XML document using minidom parser DOMTree = xml.dom.minidom.parseString('<X>' + my_xml + '</X>') collection = DOMTree.documentElement for clip_path in collection.getElementsByTagName("clipPath"): paths = clip_path.getElementsByTagName('path') for path in paths: path.setAttribute('d', '[code i want]') print DOMTree.toxml()

Data used:

my_xml = """ <clipPath id="p54dfe3d8fa"> <path d="M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z "/> </clipPath> <clipPath id="p27c84a8b3c"> <rect height="302.4" width="446.4" x="72.0" y="43.2"/> </clipPath> """

Collectives™ on Stack Overflow

Python Regex look behind

3 Answers 3

7 Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

1 Comment

1 Comment

Linked

Related