python beautifulsoup adding extra end tags

Question

I'm using Beautifulsoup to parse a website

 request = urllib2.Request(url) response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response)

I am using it to traverse a table. The problem I am running into is that BS is adding an extra end tag for the table into the html which doesn't exist, which I verified with: print soup.prettify(). So, one of the td tags is getting left out of the table and I can't select it.

Can you post the essential structure of the html code which is not working? — Federico A. Ramponi
– Federico A. Ramponi, Commented Aug 17, 2010 at 17:14
would love an answer to this as well. in my case, it seems BS is adding tags that are not in the page's source code — Hartley Brody
– Hartley Brody, Commented Apr 8, 2012 at 23:12

ebt · Accepted Answer · 2010-08-17 17:25:08Z

1

How about searching directly for each tag instead of trying to traverse into the table?

 for td in soup.find("td"): ...

its not unusual to find the tbody tag nested within a table automatically when its not in the code. Either you can code for it or just jump straight to the tr or td tag.

answered Aug 17, 2010 at 17:25

ebt

1,3581 gold badge14 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

imns Over a year ago

That's a good thought and I tried that. When I run the code above it returns the whole table not each individual td. I think BS is breaking on this pages horrible html ... bot sure what to do about it though

ebt Over a year ago

2 things, check the version your using. If you're using 3.1 switch back to 3.0 (crummy.com/software/BeautifulSoup/3.1-problems.html) else try lxml, IMHO its a better general parser than Soup.

Collectives™ on Stack Overflow

python beautifulsoup adding extra end tags

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related