4

I'm trying to use xpath to extract HTML5 microdata from a page. I'm essentially trying to say "find nested nodes with an itemprop=name attribute that are not nested inside another itemscope element (at any depth)". Given the following example I'm trying to find the name of the product (shoes) but I don't want the brand name (Nike).

<div itemscope itemtype="http://schema.org/Product> <div itemscope itemtype="http://schema.org/Brand"> <div itemprop="name">Nike</div> <!-- don't want this --> </div> <div itemprop="name">shoes</div> <!-- do want this --> </div> 

I can easily find the itemprop=name element by using something like //*[@itemprop=name] but this would also pull in the brand name. Btw the elements shown in the example may be nested inside other tags so I can't simple say "whose immediate parent does not have an itemscope attribute" I believe there may be something relating to ancestors that I can use but I don't know enough about xpath. Any ideas?

6
  • In this example shoes is inside an itemscope, so to clarify, you want the names that have at most one itemscope ancestor, but not those that have more than one? Commented Oct 14, 2014 at 16:32
  • Or do you mean that for any given itemscope element X, extract all the names that are inside X but not also inside any other itemscope? Commented Oct 14, 2014 at 16:35
  • I'm using libxml2 (xmlsoft.org) via python. To answer your original questions actually either scenario would suffice in this context but I guess on balance the second scenario is probably closer. Commented Oct 14, 2014 at 16:39
  • libxml2 is limited to XPath 1.0 so the "at most one ancestor" approach is the best you can really do in a single XPath. Commented Oct 14, 2014 at 16:46
  • cool, that should work. Out of interest is it possible to do the "all the names that are inside X but not also inside any other itemscope" using multiple expressions? Commented Oct 14, 2014 at 16:48

2 Answers 2

4

A single expression to find all the itemprop="name" elements with at most one itemscope ancestor would be

//*[@itemprop = 'name'][not(ancestor::*[@itemscope][2])] 

If you wanted to start from one specific itemscope node and find the names that are nested specifically in it (and not a nested scope) then that's not something that you can do in one XPath 1.0 expression. You'd have to first extract its descendant names

.//*[@itemprop='name'] 

and then for each of those, find its nearest itemscope ancestor

ancestor::*[@itemscope][1] 

and check (on the python side) whether or not that node is the same node as the one you started from. In XPath 2.0 you could do this in one with

for $me in . return (.//*[@itemprop='name'][ancestor::*[@itemscope][1] is $me]) 

but 1.0 doesn't have the for $x in Y return Z structure for binding variables, or the is operator to compare node identity.

Sign up to request clarification or add additional context in comments.

2 Comments

itemscope ancestor right? that's how i've understood your xpath
@TobyHobson Yes, sorry, I've fixed the typo.
1

Please give this a try:

//*[@itemprop = 'name' and not(ancestor::*[@itemscope][2])] 

Comments