Problem parsing with beautifulsoup

Question

I'm trying to parse the following web page link. Code below:

import urllib2 import sys from BeautifulSoup import BeautifulSoup url = 'http://www.etsy.com/teams/list' source = urllib2.urlopen(url) soup = BeautifulSoup(source) print soup.prettify() print len(soup('h3')) #to print the no of occurances of h3 h3s = soup.findAll('h3') #finding the same as above print len(h3s)

The problem is, it prints 1. while the web page contains atleast 10 'h3'.I couldn't figure out where the problem lies I am using python 2.7 and BeautifulSoup 3.0.7

For the record, BeautifulSoup 3.2.0 gives me 12 h3s with your code (the last two are in some locale-setting nagging overlay). — rczajka
– rczajka, Commented Aug 31, 2011 at 21:24

Zach Kelling · Accepted Answer · 2011-08-31 21:13:59Z

2

I'd recommend using lxml instead:

>>> import lxml.html >>> doc = lxml.html.parse('http://www.etsy.com/teams/list') >>> len(doc.xpath('//h3')) <<< 10

answered Aug 31, 2011 at 21:13

Zach Kelling

54.1k15 gold badges112 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

baskar_p Over a year ago

thank you.Will try using lxml and do you have any idea why BeautifulSoup doesn't give proper result for the above case?

Zach Kelling Over a year ago

No, afaik that should work. All I could suggest is trying a different version of BeautifulSoup, or preferably using lxml instead.

Collectives™ on Stack Overflow

Problem parsing with beautifulsoup

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related