0

I need a regex in python to find a links html in a larger set of html.

so if I have:

<ul class="something"> <li id="li_id"> <a href="#" title="myurl">URL Text</a> </li> </ul> 

I would get back:

<a href="#" title="myurl">URL Text</a> 

I'd like to do it with a regex and not beautifulsoup or something similar to that. Does anyone have a snippet laying around I could use for this?

Thanks

4
  • 4
    "I'd like to do it with a regex and not beautifulsoup or something similar to that." Enjoy pounding that screw with a hammer. Commented Jan 21, 2010 at 2:57
  • 2
    Seriously: DON'T use regular expressions to parse HTML. Just don't. stackoverflow.com/questions/1732348/… Commented Jan 21, 2010 at 2:59
  • 1
    Why would you like to do it with a regex and not beautifulsoup or something similar to that? Commented Jan 21, 2010 at 3:02
  • @OP, yes you can use regex, if your task is simple. Commented Jan 21, 2010 at 3:12

3 Answers 3

4

Soup is good for you:

>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup('''<ul class="something"> ... <li id="li_id"> ... <a href="#" title="myurl">URL Text</a> ... </li> ... </ul>''') 

There are many arguments you can pass to the findAll method; more here. The one line below will get you started by returning a list of all links matching some conditions.

>>> soup.findAll(href='#', title='myurl') [<a href="#" title="myurl">URL Text</a>] 

Edit: based on OP's comment, added info included:

So let's say you're interested in only tags within list elements of a certain class <li class="li_class">. You could do something like this:

>>> soup = BeautifulSoup('''<li class="li_class"> <a href="#" title="myurl">URL Text</a> <a href="#" title="myurl2">URL Text2</a></li><li class="foo"> <a href="#" title="myurl3">URL Text3</a></li>''') # just some sample html >>> for elem in soup.findAll("li", "li_class"): ... pprint(elem.findAll('a')) # requires `from pprint import pprint` ... [<a href="#" title="myurl">URL Text</a>, <a href="#" title="myurl2">URL Text2</a>] 

Soup recipe:

  1. Download the one file required.
  2. Place dl'd file in site-packages dir or similar.
  3. Enjoy your soup.
Sign up to request clarification or add additional context in comments.

1 Comment

Ok, lets say I only want to only find the a tags that are inside of <li class="li_class">. So, if the li tag doesn't have that class I don't want to return the a tag. How do I do that?
3

you really shouldn't use regexes to parse html.. ever.

try beautifulsoup or lxml.

but... you asked. so a quick and naive version might look like this:

import re html = """ <ul class="something"> <li id="li_id"> <a href="#" title="myurl">URL Text</a> </li> </ul> """ m = re.search('(<a .*>)', html) if m: print m.group(1) 

I can think of a lot of ways this would break.

1 Comment

Considering what he wants to get back, you probably want something more like /(<a .*?</a>)/. And yes, it breaks on pretty much everything.
1

you can try this since your requirement is simple. No need BeautifulSoup or regex

>>> s=""" ... <ul class="something"> ... <li id="li_id"> ... <a href="#" title="myurl">URL Text</a> ... </li> ... </ul> ... """ >>> for item in s.split("</a>"): ... if "<a href=" in item : ... print item [ item.find("<a href=") : ] + "</a>" ... <a href="#" title="myurl">URL Text</a> 

You can include a check of '<li class="li_class">' in the if statement as desired.

2 Comments

And of course lots of perfectly correct ways to write that HTML (even just switching the title and href attributes, for example!) will make this go down in flames. What a perfectly terrible "solution"!
I think you all should not jump too far ahead. What OP wants to do is supposedly very simple. You guys make it too complicated!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.