While alot of people mentioned using regex to strip html tags, there are a lot of downsides.
for example:
<p>hello world</p>I love you Should be parsed to:
Hello world I love you Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm
import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: ord('-'), 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret