Skip to main content
added 5 characters in body
Source Link
Uri Goren
  • 13.8k
  • 8
  • 62
  • 113

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you 

Should be parsed to:

Hello world I love you 

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: ord('-'), 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret 

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you 

Should be parsed to:

Hello world I love you 

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: '-', 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret 

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you 

Should be parsed to:

Hello world I love you 

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: ord('-'), 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret 
Source Link
Uri Goren
  • 13.8k
  • 8
  • 62
  • 113

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you 

Should be parsed to:

Hello world I love you 

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re import html def html2text(htm): ret = html.unescape(htm) ret = ret.translate({ 8209: '-', 8220: ord('"'), 8221: ord('"'), 160: ord(' '), }) ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE) ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret