Why does this regex not work: r'.logo.' [duplicate]

Question

I expect the following regular expression to match, but it does not. Why?

import re html = ''' <a href="#"> <img src="logo.png" alt="logo" width="100%"> </img> </a> ''' m = re.match( r'.*logo.*', html, re.M|re.I) if m: print m.group(1) if not m: print "not found"

See also stackoverflow.com/a/1732454/14122

Charles Duffy
– Charles Duffy

2014-02-10 22:27:00 +00:00
Commented Feb 10, 2014 at 22:27 — Charles Duffy
– Charles Duffy, Commented Feb 10, 2014 at 22:27

Adam Smith · Accepted Answer · 2014-02-10 22:27:14Z

12

We don't use regex to parse HTML.

REPEAT AFTER ME: WE DON'T USE REGEX TO PARSE HTML.

That said, it doesn't work because re.match explicitly only checks the beginning of the line. Use re.search or re.findall instead.

answered Feb 10, 2014 at 22:27

Adam Smith

54.6k13 gold badges84 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Mr. Polywhirl Over a year ago

Recommends beautiful soup.

SethMMorton Over a year ago

This answer is better than mine because it gets to the root of the problem.

Adam Smith Over a year ago

(quickest way to get reputation on SO? Telling someone not to use regex to parse HTML.)

Charles Duffy Over a year ago

@Mr.Polywhirl, Beautiful Soup is just a wrapper around lxml.html these days; why not use the real, underlying library (which is arguably a bit better-designed) directly?

Charles Duffy Over a year ago

Ahh, gotcha -- re.DOTALL would be necessary for the leading .* to match newlines. Now that makes sense.

|

SethMMorton · Accepted Answer · 2014-02-10 22:27:08Z

1

Use re.search. re.match assumes the match is at the beginning of the string.

answered Feb 10, 2014 at 22:27

SethMMorton

49.5k13 gold badges72 silver badges90 bronze badges

4 Comments

Charles Duffy Over a year ago

...well, arguably, the .* should allow this to match anyhow, with re.MULTILINE in use.

SethMMorton Over a year ago

Ok, so if that's not the problem, what is?

Charles Duffy Over a year ago

That's a good question, and if I knew (or, well, had time/inclination to reproduce), I'd be posting an answer myself. :)

Charles Duffy Over a year ago

Answered -- would need re.DOTALL in addition to re.MULTILINE for the leading .* to match past a newline.

Hugh Bothwell · Accepted Answer · 2014-02-10 23:04:52Z

You needed to include the re.DOTALL (== re.S) flag to allow the . to match newline (\n).

However, that returns the entire document if "logo" appears anywhere in it; not terribly useful.

Slightly better is

import re html = """ <a href="#"> <img src="logo.png" alt="logo" width="100%" /> </a> """ match_logo = re.compile(r'<[^<]*logo[^>]*>', flags = re.I | re.S) for found in match_logo.findall(html): print(found)

which returns

<img src="logo.png" alt="logo" width="100%" />

Better yet would be

from bs4 import BeautifulSoup pg = BeautifulSoup(html) print pg.find("img", {"alt":"logo"})

Collectives™ on Stack Overflow

Why does this regex not work: r'.logo.' [duplicate]

3 Answers 3

9 Comments

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

4 Comments

Comments

Linked

Related