0

I expect the following regular expression to match, but it does not. Why?

import re html = ''' <a href="#"> <img src="logo.png" alt="logo" width="100%"> </img> </a> ''' m = re.match( r'.*logo.*', html, re.M|re.I) if m: print m.group(1) if not m: print "not found" 
1

3 Answers 3

12

We don't use regex to parse HTML.

REPEAT AFTER ME: WE DON'T USE REGEX TO PARSE HTML.

That said, it doesn't work because re.match explicitly only checks the beginning of the line. Use re.search or re.findall instead.

Sign up to request clarification or add additional context in comments.

9 Comments

Recommends beautiful soup.
This answer is better than mine because it gets to the root of the problem.
(quickest way to get reputation on SO? Telling someone not to use regex to parse HTML.)
@Mr.Polywhirl, Beautiful Soup is just a wrapper around lxml.html these days; why not use the real, underlying library (which is arguably a bit better-designed) directly?
Ahh, gotcha -- re.DOTALL would be necessary for the leading .* to match newlines. Now that makes sense.
|
1

Use re.search. re.match assumes the match is at the beginning of the string.

4 Comments

...well, arguably, the .* should allow this to match anyhow, with re.MULTILINE in use.
Ok, so if that's not the problem, what is?
That's a good question, and if I knew (or, well, had time/inclination to reproduce), I'd be posting an answer myself. :)
Answered -- would need re.DOTALL in addition to re.MULTILINE for the leading .* to match past a newline.
1

You needed to include the re.DOTALL (== re.S) flag to allow the . to match newline (\n).

However, that returns the entire document if "logo" appears anywhere in it; not terribly useful.

Slightly better is

import re html = """ <a href="#"> <img src="logo.png" alt="logo" width="100%" /> </a> """ match_logo = re.compile(r'<[^<]*logo[^>]*>', flags = re.I | re.S) for found in match_logo.findall(html): print(found) 

which returns

<img src="logo.png" alt="logo" width="100%" /> 

Better yet would be

from bs4 import BeautifulSoup pg = BeautifulSoup(html) print pg.find("img", {"alt":"logo"}) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.