Extract data from html

Question

I have a html document with the structure:

<!DOCTYPE html> <html> <body> <p>One</p> <p>Two</p> <p>Three</p> </body> </html>

Advise module for Python, with which I can make:

var = ModuleName.html.bode.p2 print(var) Two

Use Beautifulsoup and CSS selectors or lxml

Learner
– Learner

2015-11-24 16:00:25 +00:00
Commented Nov 24, 2015 at 16:00 — Learner
– Learner, Commented Nov 24, 2015 at 16:00

alecxe · Accepted Answer · 2015-11-24 16:04:41Z

BeautifulSoup would make it quite close to what you are asking about:

from bs4 import BeautifulSoup soup = BeautifulSoup(data) print(soup.html.body("p")[1].text) # prints Two

In other words, the dot here shortcuts to "find", the parenthesis shortcut to "find all".

Paul K. · Accepted Answer · 2015-11-24 16:18:19Z

I would recommend you use BeautifulSoup to parse your HTML and extract the content you want with css selectors.

You can find an example of something very similar to what you want to do in the documentation : http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

Edit: Here is a snippet of code since the documentation has a typo and it ommits the ":" in the selector string.

from bs4 import BeautifulSoup data = "<!DOCTYPE html> <html> <body><p>One</p><p>Two</p><p>Three</p></body></html>" soup = BeautifulSoup(data, 'html.parser') print soup.body.select("p:nth-of-type(2)")

Collectives™ on Stack Overflow

Extract data from html

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related