1

I have the following HTML that I'm trying to parse using the HTML Agility Pack.

This is a snippet of the whole file that is returned by the code:

<div class="story-body fnt-13 p20-b user-gen"> <p>text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text </p> <div class="gallery clr bdr aln-c js-no-shadow mod cld"> <div> <ol> <li class="fader-item aln-c "> <div class="imageWrap m10-b"> &#8203;<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" /> </div> <p class="caption">caption text</p> </li> </ol> </div> </div > <p>text here text here text text here text here text text here text here text text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p> <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p> </div> 

I get this snippet of code using the following (which is messy i know)

string url = "http://www.domain.com/story.html"; var webGet = new HtmlWeb(); var document = webGet.Load(url); var links = document.DocumentNode .Descendants("div") .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) // .SelectMany(div => div.Descendants("p")) .ToList(); int cn = links.Count; HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]"); foreach (HtmlAgilityPack.HtmlNode node in tl) { textBox1.AppendText(node.InnerText.Trim()); textBox1.AppendText(System.Environment.NewLine); } 

The code loops through each p and (for now) appends it to a textbox. All is working correctly other than the div tag with the class gallery clr bdr aln-c js-no-shadow mod cld. The result of this bit of HTML is that I get the &#8203; and caption text bits.

what's the best way to omit that from the results?

2
  • Psst...So two questions, what's the best way to omit that from the results? That's one question, what is the other? Commented Nov 28, 2011 at 19:56
  • 1
    i have no idea what your talking about.... :p Commented Nov 28, 2011 at 20:00

2 Answers 2

3

XPATH is your friend. Try this and forget about that crappy xlink syntax :-)

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]"); foreach (HtmlAgilityPack.HtmlNode node in tl) { Console.WriteLine(node.InnerText.Trim()); } 

This expression will select all P nodes that don't have any attributes set. See here for other samples: XPath Syntax

Sign up to request clarification or add additional context in comments.

7 Comments

thanks that works a treat, will look into xpath too as it looks a far better solution!
it did work but it also includes other P nodes from over the page. It was only a snippet posted at the top.
Just add other filters to the expression (what's between the [ and ] characters)
with xpath is there a way to get a collection of nodes from within a certain div with a certain class. i.e <div class="story-body fnt-13 p20-b user-gen">
Sure. SelectNodes("//div[@class='story']") will get all div from the root with a 'class' attribute that has the 'story' value.
|
1

It's not quite clear what you're asking. I think you're asking how to get just the direct descendants of a particular div. If that's the case, then use ChildNodes rather than Descendants. That is:

.SelectMany(div => div.ChildNodes().Where(n => n.Name == "p")) 

The problem is that Descendants does a fully recursive walk of the document tree.

4 Comments

easier wold be with xpath: //p
that would include <p class="caption">caption text</p>. i'm trying to not include anything from the 4th line to the 15th line (divs) just the other <p>'s
@Nathan: No, I don't think it would include those. ChildNodes only gets the direct descendants of a particular node. If you replace the SelectMany in your LINQ expression with my SelectMany, I think you'll find that it will work as advertised. My expression uses Where because there is no ChildNodes overload that lets you specify the type (i.e. you can't say, ChildNodes("p")).
ok, i think i understand what your saying. Like the following? var links = document.DocumentNode .Descendants("div") .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) // .SelectMany(div => div.ChildNodes.Where(n => n.Name == "p")) .ToList();

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.