HtmlAgilityPack giving problems with malformed html

Question

I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code:

string convertedContent = HttpUtility.HtmlDecode( ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString)) );

ConvertHtml:

public string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); }

ConvertTo:

public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText) { string html; switch (node.NodeType) { case HtmlAgilityPack.HtmlNodeType.Comment: // don't output comments break; case HtmlAgilityPack.HtmlNodeType.Document: foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } break; case HtmlAgilityPack.HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) break; // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) break; // check the text is meaningful and not a bunch of whitespaces if (html.Trim().Length > 0) { outText.Write(HtmlEntity.DeEntitize(html) + " "); } break; case HtmlAgilityPack.HtmlNodeType.Element: switch (node.Name) { case "p": // treat paragraphs as crlf outText.Write("\r\n"); break; } if (node.HasChildNodes) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } } break; } }

Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document.

Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document?

Thanks, Kapil

PanJanek · Accepted Answer · 2010-05-31 15:03:55Z

As it is said in the HtmlAgilityPack project page "The parser is very tolerant with 'real world' malformed HTML". But the kind of error you describe is too serious maybe to be corrected. You can set the default encoding with:

 HtmlDocument doc = new HtmlDocument(); doc.OptionDefaultStreamEncoding = Encoding.UTF8;

Collectives™ on Stack Overflow

HtmlAgilityPack giving problems with malformed html

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related