3

I'm using BeautifulSoup 4 under Anaconda's distribution as bs4. Correct me if I'm wrong - I'm understanding BeautifulSoup is lib for transforming ill-formed HTML into well-formed one. But, when I'm assigning HTML to it's constructor, I lose more then half of it's characters. Shouldn't it be only fixing HTML and not cleaning it? In docs it's not well described.

This is the code:

from bs4 import BeautifulSoup soup = BeautifulSoup(html) 

where html is HTML of Google's homepage.

Edit:

Could it be from the way I'm retrieving string of HTML via str(soup)?

2
  • Can you add a little more code showing how you are getting the html? Also, does soup.prettify() look more in line with what you expect versus str(soup) based on your edit. Lastly, can you try it with a simpler web page and post the before and after (assuming you can find something in line with the SO recommended minimal, complete, verified example) Commented Mar 12, 2015 at 2:23
  • I'm retrieving HTML from DOM sent to my MongoDB database. I just extract JSON, read it in python and transform it to string. Yea, I'll do that with simpler websites, thanks for advice. Commented Mar 12, 2015 at 2:30

1 Answer 1

3

First of all, make sure you see these "missing tags" in the html coming into BeautifulSoup to parse. It could be that the problem is not in how BeautifulSoup parses the HTML, but in how you are retrieving the HTML data to parse.

I suspect, you are downloading the google homepage via urllib2 or requests and compare what you see inside str(soup) with what you see in a real browser. If this is case, then you cannot compare the two, since neither urllib2, nor requests is a browser and cannot execute javascript or manipulate DOM after the page load, or make asynchronous requests. What you get with urllib2 or requests is basically an initial HTML page "without a dynamic part".


If the problem is still in how BeautifulSoup parses the HTML...

As it clearly stated in docs, the behavior depends on which parser BeautifulSoup would choose to use under-the-hood:

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document. But if the document is not perfectly-formed, different parsers will give different results.

See Installing a parser and Specifying the parser to use.

Since you don't specify a parser explicitly, the following rule is applied:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

See also Differences between parsers.


In other words, try to approach the problem using different parsers and see how the result would differ:

soup = BeautifulSoup(html, 'lxml') soup = BeautifulSoup(html, 'html5lib') soup = BeautifulSoup(html, 'html.parser') 
Sign up to request clarification or add additional context in comments.

6 Comments

No, I'm retriving HTML from JSON attribute value and I compare the both strings in Python app. Tried with lxml and I'm still getting 10k less characters. With html.parser I finally got bigger amount of characters, but now when I'm trying to parse it as ElementTree I'm getting "XMLSyntaxError: Opening and ending tag mismatch: img line 2 and a, line 2, column 478" which means that forming HTML didn't do it's job.
@Tommz thanks for the update, yeah ElementTree is not an option since it is HTML and not XML.
@Tommz could you also provide a reproduceable example or share the current HTML you are dealing with pointing out the missing parts after parsing? Have some ideas. Thanks.
Doesn't BeautifulSoup make HTML well-formed which is equivalent to XML rules?
@Tommz we are talking about different things, here you basically tried to parse a non-well-formed HTML with an XML parser. Hope that makes sense.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.