Return to Answer

added 5 characters in body

edited Jan 21, 2019 at 19:46

13.8k
8
62
113

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you

Should be parsed to:

Hello world I love you

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: ord('-'), 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you

Should be parsed to:

Hello world I love you

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: '-', 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you

Should be parsed to:

Hello world I love you

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: ord('-'), 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret

Source Link

answered Jan 21, 2019 at 19:30

Uri Goren

13.8k
8
62
113

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you

Should be parsed to:

Hello world I love you

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: '-', 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret

Collectives™ on Stack Overflow

Return to Answer