Extract text from HTML in python [duplicate]

Question

Possible Duplicate:
Extracting text from HTML file using Python

What is the best way in Python to extract text from HTML pages in the same way that browser does when you copy-paste?

possible duplicate. I recommend this answer : stackoverflow.com/a/3987802/117092 — luc
– luc, Commented Jan 13, 2012 at 6:26

Makoto · Accepted Answer · 2012-01-13 02:19:46Z

5

BeautifulSoup is a popular option for reading and parsing HTML pages.

answered Jan 13, 2012 at 2:19

Makoto

107k29 gold badges199 silver badges236 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

yurisich Over a year ago

Dang. What easy points, @Makoto! :D

Community · Accepted Answer · 2017-05-23 12:24:33Z

The question that monkut references doesn't give any Python solution to the exact problem. While BeautifulSoup and lxml both can be used to parse html, there is still a big step from there to text that approximates the formatting that is embedded in the html.

To do this, I have resorted to non-python solutions (which I've blogged about, but will resist linking here-- not sure of the SO etiquette). If you are on a *nix system, you can install this html2text package from Germany. It can be installed easily on a MacOS with Homebrew ($ brew install html2text) or Macports ($ sudo port install html2text), and on other *nix systems through their package managers. It has a number of useful options, and I use it like this:

html2text -nobs -ascii -width 200 -style pretty -o filename.txt - < filename.html

You can also install a text-based browser (e.g. w3m) and use it to produce formatted text from html using the following command-line syntax: w3m filename.html -dump > file.txt

You can, of course, access these solutions from Python using the subprocess module or the popular envoy wrapper for subprocess.

Even after all this effort, you may find that some important information (e.g. <u> tags) are not handled in a way you like, but those are the best current options that I have found.

Collectives™ on Stack Overflow

Extract text from HTML in python [duplicate]

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related