I have the following HTML that I'm trying to parse using the HTML Agility Pack.
This is a snippet of the whole file that is returned by the code:
<div class="story-body fnt-13 p20-b user-gen"> <p>text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text </p> <div class="gallery clr bdr aln-c js-no-shadow mod cld"> <div> <ol> <li class="fader-item aln-c "> <div class="imageWrap m10-b"> ​<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" /> </div> <p class="caption">caption text</p> </li> </ol> </div> </div > <p>text here text here text text here text here text text here text here text text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p> </div> I get this snippet of code using the following (which is messy i know)
string url = "http://www.domain.com/story.html"; var webGet = new HtmlWeb(); var document = webGet.Load(url); var links = document.DocumentNode .Descendants("div") .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) // .SelectMany(div => div.Descendants("p")) .ToList(); int cn = links.Count; HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]"); foreach (HtmlAgilityPack.HtmlNode node in tl) { textBox1.AppendText(node.InnerText.Trim()); textBox1.AppendText(System.Environment.NewLine); } The code loops through each p and (for now) appends it to a textbox. All is working correctly other than the div tag with the class gallery clr bdr aln-c js-no-shadow mod cld. The result of this bit of HTML is that I get the ​ and caption text bits.
what's the best way to omit that from the results?
So two questions, what's the best way to omit that from the results?That's one question, what is the other?