Parsing HTML using HTMLAgilityPack

Question

I have the following HTML that I'm trying to parse using the HTML Agility Pack.

This is a snippet of the whole file that is returned by the code:

<div class="story-body fnt-13 p20-b user-gen"> <p>text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text </p> <div class="gallery clr bdr aln-c js-no-shadow mod cld"> <div> <ol> <li class="fader-item aln-c "> <div class="imageWrap m10-b"> &#8203;<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" /> </div> <p class="caption">caption text</p> </li> </ol> </div> </div > <p>text here text here text text here text here text text here text here text text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p> </div>

I get this snippet of code using the following (which is messy i know)

string url = "http://www.domain.com/story.html"; var webGet = new HtmlWeb(); var document = webGet.Load(url); var links = document.DocumentNode .Descendants("div") .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) // .SelectMany(div => div.Descendants("p")) .ToList(); int cn = links.Count; HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]"); foreach (HtmlAgilityPack.HtmlNode node in tl) { textBox1.AppendText(node.InnerText.Trim()); textBox1.AppendText(System.Environment.NewLine); }

The code loops through each p and (for now) appends it to a textbox. All is working correctly other than the div tag with the class gallery clr bdr aln-c js-no-shadow mod cld. The result of this bit of HTML is that I get the  and caption text bits.

what's the best way to omit that from the results?

Psst...So two questions, what's the best way to omit that from the results? That's one question, what is the other? — Chad
– Chad, Commented Nov 28, 2011 at 19:56

Simon Mourier · Accepted Answer · 2011-11-28 21:54:30Z

3

XPATH is your friend. Try this and forget about that crappy xlink syntax :-)

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]"); foreach (HtmlAgilityPack.HtmlNode node in tl) { Console.WriteLine(node.InnerText.Trim()); }

This expression will select all P nodes that don't have any attributes set. See here for other samples: XPath Syntax

answered Nov 28, 2011 at 21:54

Simon Mourier

141k22 gold badges269 silver badges320 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Nathan Over a year ago

thanks that works a treat, will look into xpath too as it looks a far better solution!

Nathan Over a year ago

it did work but it also includes other P nodes from over the page. It was only a snippet posted at the top.

Simon Mourier Over a year ago

Just add other filters to the expression (what's between the [ and ] characters)

Nathan Over a year ago

with xpath is there a way to get a collection of nodes from within a certain div with a certain class. i.e <div class="story-body fnt-13 p20-b user-gen">

Simon Mourier Over a year ago

Sure. SelectNodes("//div[@class='story']") will get all div from the root with a 'class' attribute that has the 'story' value.

|

Jim Mischel · Accepted Answer · 2011-11-28 20:12:18Z

1

It's not quite clear what you're asking. I think you're asking how to get just the direct descendants of a particular div. If that's the case, then use ChildNodes rather than Descendants. That is:

.SelectMany(div => div.ChildNodes().Where(n => n.Name == "p"))

The problem is that Descendants does a fully recursive walk of the document tree.

answered Nov 28, 2011 at 20:12

Jim Mischel

135k25 gold badges197 silver badges377 bronze badges

4 Comments

maxlego Over a year ago

easier wold be with xpath: //p

Nathan Over a year ago

that would include <p class="caption">caption text</p>. i'm trying to not include anything from the 4th line to the 15th line (divs) just the other <p>'s

Jim Mischel Over a year ago

@Nathan: No, I don't think it would include those. ChildNodes only gets the direct descendants of a particular node. If you replace the SelectMany in your LINQ expression with my SelectMany, I think you'll find that it will work as advertised. My expression uses Where because there is no ChildNodes overload that lets you specify the type (i.e. you can't say, ChildNodes("p")).

Nathan Over a year ago

ok, i think i understand what your saying. Like the following? var links = document.DocumentNode .Descendants("div") .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) // .SelectMany(div => div.ChildNodes.Where(n => n.Name == "p")) .ToList();

Collectives™ on Stack Overflow

Parsing HTML using HTMLAgilityPack

2 Answers 2

7 Comments

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

4 Comments

Related