Match href value with a regular expression

Question

My input is similar to this:

<a href="link">text</a> <a href="correctLink">See full summary</a>

From this string i want to get only correctLink (the link that has See full summary as text) .

I'm working with python, and i tried:

re.compile( '<a href="(.*?)">See full summary</a>', re.DOTALL | re.IGNORECASE )

but the only string i get with findall() is link">text</a> <a href="correctLink.

Where is my mistake?

Martijn Pieters · Accepted Answer · 2013-03-13 13:07:11Z

Limit your link pattern to non-quote characters:

re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE)

giving:

>>> import re >>> patt = re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE) >>> patt.findall('<a href="link">text</a> <a href="correctLink">See full summary</a>') ['correctLink']

Better yet, use a proper HTML parser.

Using BeautifulSoup, finding that link would be as easy as:

soup.find('a', text='See full summary')['href']

for an exact text match:

>>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup('<a href="link">text</a> <a href="correctLink">See full summary</a>') >>> soup.find('a', text='See full summary')['href'] u'correctLink'

Collectives™ on Stack Overflow

Match href value with a regular expression

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related