1

I'm using Beautifulsoup to parse a website

 request = urllib2.Request(url) response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) 

I am using it to traverse a table. The problem I am running into is that BS is adding an extra end tag for the table into the html which doesn't exist, which I verified with: print soup.prettify(). So, one of the td tags is getting left out of the table and I can't select it.

2
  • 1
    Can you post the essential structure of the html code which is not working? Commented Aug 17, 2010 at 17:14
  • would love an answer to this as well. in my case, it seems BS is adding tags that are not in the page's source code Commented Apr 8, 2012 at 23:12

1 Answer 1

1

How about searching directly for each tag instead of trying to traverse into the table?

 for td in soup.find("td"): ... 

its not unusual to find the tbody tag nested within a table automatically when its not in the code. Either you can code for it or just jump straight to the tr or td tag.

Sign up to request clarification or add additional context in comments.

2 Comments

That's a good thought and I tried that. When I run the code above it returns the whole table not each individual td. I think BS is breaking on this pages horrible html ... bot sure what to do about it though
2 things, check the version your using. If you're using 3.1 switch back to 3.0 (crummy.com/software/BeautifulSoup/3.1-problems.html) else try lxml, IMHO its a better general parser than Soup.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.