1

I have the following text:

<clipPath id="p54dfe3d8fa"> <path d="M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z "/> </clipPath> <clipPath id="p27c84a8b3c"> <rect height="302.4" width="446.4" x="72.0" y="43.2"/> </clipPath> 

I need to grab this portion out:

d="M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z " 

I need to replace this section with something else. I was able to grab the entirety of <clipPath ...><path d="[code i want]"/> but this doesn't help me because I can't override the id in the <clipPath> element.

Note that there are other <clipPath> elements that I do not want to touch. I only want to change <path> elements within <clipPath> elements.

I'm thinking that the answer has to do with selecting everything before a clipPath element and ending at the Path section. Any help would be entirely appreciated.

I've been using http://pythex.org/ for help and have also seen odd behavior (having to do with multiline and spaces) that don't act the same between that and python 3.x code.

Here are some of the things I've tried:

reg = r'(<clipPath.* id=".*".*>)' reg = re.compile(r'(<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+")') reg = re.compile(r'((?<!<clipPath).* id=".*".*>\s*<path.*d="(.*\n)+")') g = reg.search(text) g 
9
  • 1
    Can clipPaths be nested? Commented Jan 27, 2017 at 20:02
  • No, I don't think so. Commented Jan 27, 2017 at 20:05
  • 1
    see also stackoverflow.com/questions/15857818/python-svg-parser Commented Jan 27, 2017 at 20:06
  • Why are you doing this with regex? Commented Jan 27, 2017 at 20:06
  • 1
    is this an xml ? why won't you do this with xml.etree.ElementTree or lxml ? Commented Jan 27, 2017 at 20:11

3 Answers 3

3

regex is never the proper way of parsing xml.

Here's a simple standalone example which does it using lxml:

from lxml import etree text="""<clipPath id="p54dfe3d8fa"> <path d="M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z "/> </clipPath> <clipPath id="p27c84a8b3c"> <rect height="302.4" width="446.4" x="72.0" y="43.2"/> </clipPath>""" # This creates <metrics> root = etree.XML("<X>"+text+"</X>") p = root.find(".//path") print(p.get("d")) 

result:

M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z 
  • first, I create the main node. Since there are several nodes, I wrap it in an arbitrary main node
  • then I look for "path" anywhere
  • once found, I get the d attribute

Now I'm changing the text for d and dump it:

p.set("d","[new text]") print(etree.tostring(root)) 

now the output is like:

... <path d="[new text]"/>\n ... 

still, quick and dirty, maybe not robust to several path nodes, but works with the snippet you provided (and I'm no xml expert, just fumbling)

BTW, another hacky/non-regex way of doing it: using multi-character split:

text.split(' d="')[1].split('"/>')[0] 

taking the second part after d delimiter, then the first part after /> delimiter. Preserves the multi-line formatting.

Sign up to request clarification or add additional context in comments.

7 Comments

Nice for advising a solution without regexes. +1.
never say never ;) it may be better practice to use lxml or similar, but the author also stated he wanted to learn regex
@Aaron I understand that, then the OP should practice on something else than a nested syntax language with several possible syntaxes, like for instance a text file with line-per-line data. If OP wanted to parse C or Java with regex, it would be foolish as well.
(Plus since I don't know zit about xml and I could find my way by trial and error in a few minutes, that could convince more xml newbies that it's not that hard to use, even if I personally hate xml)
In my personal experience, I sometimes run into poorly formed xml that requires some regex lovin'
|
2

TL;DR: r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d=("(?:.*\n)+?")'

let's break that down...

you started with: r'(<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+")' which enclosed your entire capture pattern inside a group, so the whole element would be captured in the match object. Let's take out those parenthesis: r'<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+"'

next you seem to use .* quite often, which can be dangerous because it is blind and greedy. for the clipPath id, if you know the id is always alphanumeric, a better solution might be r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d="(.*\n)+"'

finally, let's look at what you actually want to capture. your example shows you want to capture the quotation marks, so let's get those inside our capture group: ...*d=("(.*\n)+"). This leaves us with a weird nested group situation though, so let's make the inner group non-capturing: ...*d=("(?:.*\n)+").

now we're capturing what you want, but we still have a problem... what if there are multiple elements that satisfy these criteria? the greedy matching of the + in ...*d=("(.*\n)+") will capture ever line in-between. What we can do here is to make the + non greedy by following it with a ?: ...*d=("(?:.*\n)+?").

put all these things together:

r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d=("(?:.*\n)+?")'

1 Comment

This pretty much directly answers my question using a regex and that's awesome! I believe that one of the SVG/XML parsing libs is what I'm going to ultimately go with but this is going to get marked as correct because of your effort and explanation. Thanks!
1

An xml based solution that edits the path.

import xml.dom.minidom # Open XML document using minidom parser DOMTree = xml.dom.minidom.parseString('<X>' + my_xml + '</X>') collection = DOMTree.documentElement for clip_path in collection.getElementsByTagName("clipPath"): paths = clip_path.getElementsByTagName('path') for path in paths: path.setAttribute('d', '[code i want]') print DOMTree.toxml() 

Data used:

my_xml = """ <clipPath id="p54dfe3d8fa"> <path d="M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z "/> </clipPath> <clipPath id="p27c84a8b3c"> <rect height="302.4" width="446.4" x="72.0" y="43.2"/> </clipPath> """ 

1 Comment

This is pretty much what I ended up doing. Thank you!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.