python, regex to find anchor link html

Question

I need a regex in python to find a links html in a larger set of html.

so if I have:

<ul class="something"> <li id="li_id"> <a href="#" title="myurl">URL Text</a> </li> </ul>

I would get back:

<a href="#" title="myurl">URL Text</a>

I'd like to do it with a regex and not beautifulsoup or something similar to that. Does anyone have a snippet laying around I could use for this?

Thanks

"I'd like to do it with a regex and not beautifulsoup or something similar to that." Enjoy pounding that screw with a hammer. — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented Jan 21, 2010 at 2:57
Seriously: DON'T use regular expressions to parse HTML. Just don't. stackoverflow.com/questions/1732348/… — Alex Martelli
– Alex Martelli, Commented Jan 21, 2010 at 2:59
Why would you like to do it with a regex and not beautifulsoup or something similar to that? — SLaks
– SLaks, Commented Jan 21, 2010 at 3:02

mechanical_meat · Accepted Answer · 2010-01-21 04:01:16Z

Soup is good for you:

>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup('''<ul class="something"> ... <li id="li_id"> ... <a href="#" title="myurl">URL Text</a> ... </li> ... </ul>''')

There are many arguments you can pass to the findAll method; more here. The one line below will get you started by returning a list of all links matching some conditions.

>>> soup.findAll(href='#', title='myurl') [<a href="#" title="myurl">URL Text</a>]

Edit: based on OP's comment, added info included:

So let's say you're interested in only tags within list elements of a certain class <li class="li_class">. You could do something like this:

>>> soup = BeautifulSoup('''<li class="li_class"> <a href="#" title="myurl">URL Text</a> <a href="#" title="myurl2">URL Text2</a></li><li class="foo"> <a href="#" title="myurl3">URL Text3</a></li>''') # just some sample html >>> for elem in soup.findAll("li", "li_class"): ... pprint(elem.findAll('a')) # requires `from pprint import pprint` ... [<a href="#" title="myurl">URL Text</a>, <a href="#" title="myurl2">URL Text2</a>]

Soup recipe:

Download the one file required.
Place dl'd file in site-packages dir or similar.
Enjoy your soup.

Ok, lets say I only want to only find the a tags that are inside of <li class="li_class">. So, if the li tag doesn't have that class I don't want to return the a tag. How do I do that?

Corey Goldberg · Accepted Answer · 2010-01-21 03:04:34Z

you really shouldn't use regexes to parse html.. ever.

try beautifulsoup or lxml.

but... you asked. so a quick and naive version might look like this:

import re html = """ <ul class="something"> <li id="li_id"> <a href="#" title="myurl">URL Text</a> </li> </ul> """ m = re.search('(<a .*>)', html) if m: print m.group(1)

I can think of a lot of ways this would break.

Considering what he wants to get back, you probably want something more like /(<a .*?</a>)/. And yes, it breaks on pretty much everything.

ghostdog74 · Accepted Answer · 2010-01-21 03:37:48Z

you can try this since your requirement is simple. No need BeautifulSoup or regex

>>> s=""" ... <ul class="something"> ... <li id="li_id"> ... <a href="#" title="myurl">URL Text</a> ... </li> ... </ul> ... """ >>> for item in s.split("</a>"): ... if "<a href=" in item : ... print item [ item.find("<a href=") : ] + "</a>" ... <a href="#" title="myurl">URL Text</a>

You can include a check of '<li class="li_class">' in the if statement as desired.

And of course lots of perfectly correct ways to write that HTML (even just switching the title and href attributes, for example!) will make this go down in flames. What a perfectly terrible "solution"!
I think you all should not jump too far ahead. What OP wants to do is supposedly very simple. You guys make it too complicated!

Collectives™ on Stack Overflow

python, regex to find anchor link html

3 Answers 3

1 Comment

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

2 Comments

Linked

Related