I have an html encoded xml payload where I'd like to use regular expressions to extract a certain piece of that xml and write that to a file. I know it's generally not good practice to use regex for xml but I think this a special use case.
Anyway here is a sample encoded xml:
<root> <parent> <test1> <another> <subelement> <value>hello</value> </subelement> </another> </test1> </parent> </root> I ultimately want my result to be:
<test1> <another> <subelement> <value>hello</value> </subelement> </another> </test1> Here is my implementation in python to extract out all the text between the <test1> and </test1> inclusively:
import html import re file_stream = open('/path/to/test.xmp', 'r') raw_data = file_stream.read() escaped_raw_data = html.unescape(raw_data) result = re.search(r"<test1[\s\S]*?<\/test1>", escaped_raw_data) However I get no matches for result, what am I doing wrong? How to accomplish my goal?
., use[\s\S]because.does not match newlinesresult = re.search(r"<test1[\s\S]*?<\/test1>", escaped_raw_data)still I get a result ofNone<test1[\s\S]*?<\/test1>result = re.search(r"<test1>.*<\/test1>", escaped_raw_data, re.DOTALL).