0

My input is similar to this:

<a href="link">text</a> <a href="correctLink">See full summary</a> 

From this string i want to get only correctLink (the link that has See full summary as text) .

I'm working with python, and i tried:

re.compile( '<a href="(.*?)">See full summary</a>', re.DOTALL | re.IGNORECASE ) 

but the only string i get with findall() is link">text</a> <a href="correctLink.

Where is my mistake?

0

1 Answer 1

1

Limit your link pattern to non-quote characters:

re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE) 

giving:

>>> import re >>> patt = re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE) >>> patt.findall('<a href="link">text</a> <a href="correctLink">See full summary</a>') ['correctLink'] 

Better yet, use a proper HTML parser.

Using BeautifulSoup, finding that link would be as easy as:

soup.find('a', text='See full summary')['href'] 

for an exact text match:

>>> from bs4 import BeautifulSoup >>> soup=BeautifulSoup('<a href="link">text</a> <a href="correctLink">See full summary</a>') >>> soup.find('a', text='See full summary')['href'] u'correctLink' 
Sign up to request clarification or add additional context in comments.

Comments