I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code:
string convertedContent = HttpUtility.HtmlDecode( ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString)) ); ConvertHtml:
public string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); StringWriter sw = new StringWriter(); ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } ConvertTo:
public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText) { string html; switch (node.NodeType) { case HtmlAgilityPack.HtmlNodeType.Comment: // don't output comments break; case HtmlAgilityPack.HtmlNodeType.Document: foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } break; case HtmlAgilityPack.HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) break; // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) break; // check the text is meaningful and not a bunch of whitespaces if (html.Trim().Length > 0) { outText.Write(HtmlEntity.DeEntitize(html) + " "); } break; case HtmlAgilityPack.HtmlNodeType.Element: switch (node.Name) { case "p": // treat paragraphs as crlf outText.Write("\r\n"); break; } if (node.HasChildNodes) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText); } } break; } } Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document.
Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document?
Thanks, Kapil