Revision 44e588ea-e8db-42e0-96a1-c1ff9916a4af

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

 <p>hello&nbsp;world</p>I love you
Should be parsed to:

 Hello world
 I love you
Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

 import re
 import html
 def html2text(htm):
 ret = html.unescape(htm)
 ret = ret.translate({
 8209: '-',
 8220: ord('"'),
 8221: ord('"'),
 160: ord(' '),
 })
 ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
 ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
 ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
 ret = re.sub(r" +", " ", ret)
 return ret