0

actually, I'm testing some substitutions in html-files with the following code:

text = re.sub(u'<div class="paragraph" style="[^"]+"><span class="font61"><i>Test. </i>55<span class="font16"></span><span style=" letter-spacing:-0.70pt;"> </span></span></div>', u'<div class="paragraph" style="\1"><span class="font61"><i>Test.</i><span class="font16"></span><span style=" letter-spacing:-0.70pt;">55</span></span></div>', text) 

Unfortunately, my output is:

 <div class="paragraph" style=""><span class="font61"><i>Test. </i><span class="font16"></span><span style=" letter-spacing:-0.70pt;">55</span></span></div> 

Instead of "style=" padding:6.00pt 63.36pt 0.00pt 43.68pt; text-align:justify;"", I receive a special character, which also cannot be displayed here. How can I fix this problem?

In other words: If I have something like:

<div class="paragraph" style=" padding:0.00pt 0.00pt 0.00pt 90.24pt; text-align:left;"><span class="font61"><i>Test </i>55<span class="font16"></span><span style=" letter-spacing:-0.70pt;"> </span></span></div> 

(The important thing is: < /i> + number + < span class = ), I'd like to move the number to the last gap (here, before < /span> ). How can I do this?

1 Answer 1

1

re.sub() is doing what it has been told.

Assuming that you have entered the replacement string correctly in your question, the "special character" "\x01" is within your replacement string (2nd argument to re.sub()):

u'<div class="paragraph" style="\x01">.........' 

Try changing your replacement string to:

u'<div class="paragraph" style="padding:6.00pt 63.36pt 0.00pt 43.68pt; text-align:justify;"><span class="font61"><i>Test.</i><span class="font16"></span><span style=" letter-spacing:-0.70pt;">55</span></span></div>' 

However, you are probably better off using a library like BeautifulSoup to help you parse and process the HTML, rather than use regular expressions.

Sign up to request clarification or add additional context in comments.

1 Comment

@MarkF6 this answer is correct, except that the literal character sequence isn't \x01 it's \1. Use \\1 or use a raw string literal ur'\1'.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.