-1

Possible Duplicate:
Extracting text from HTML file using Python

What is the best way in Python to extract text from HTML pages in the same way that browser does when you copy-paste?

1

2 Answers 2

5

BeautifulSoup is a popular option for reading and parsing HTML pages.

Sign up to request clarification or add additional context in comments.

1 Comment

Dang. What easy points, @Makoto! :D
2

The question that monkut references doesn't give any Python solution to the exact problem. While BeautifulSoup and lxml both can be used to parse html, there is still a big step from there to text that approximates the formatting that is embedded in the html.

To do this, I have resorted to non-python solutions (which I've blogged about, but will resist linking here-- not sure of the SO etiquette). If you are on a *nix system, you can install this html2text package from Germany. It can be installed easily on a MacOS with Homebrew ($ brew install html2text) or Macports ($ sudo port install html2text), and on other *nix systems through their package managers. It has a number of useful options, and I use it like this:

html2text -nobs -ascii -width 200 -style pretty -o filename.txt - < filename.html

You can also install a text-based browser (e.g. w3m) and use it to produce formatted text from html using the following command-line syntax: w3m filename.html -dump > file.txt

You can, of course, access these solutions from Python using the subprocess module or the popular envoy wrapper for subprocess.

Even after all this effort, you may find that some important information (e.g. <u> tags) are not handled in a way you like, but those are the best current options that I have found.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.