1

I often use XPath with php for parsing pages, but this time i don't understand the behavior with this specific page with the following code, I hope you can help me on this.

Code that I use to parse this page http://www.jeuxvideo.com/recherche.php?m=9&t=10&q=Call+of+duty :

<?php $What = 'Call of duty'; $What = urlencode($What); $Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $Query); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20); $response = curl_exec($ch); curl_close($ch); /* $search = array("<article", "</article>"); $replace = array("<div", "</div>"); $response = str_replace($search, $replace, $response); */ $dom = new DOMDocument(); @$dom->loadHTML($response); $xpath = new DOMXPath($dom); $elements = $xpath->query('//article[@class="recherche-aphabetique-item"]/a'); //$elements = $xpath->query('//div[@class="recherche-aphabetique-item"]/a'); count($elements); var_dump($elements); ?> 

fiddle to test it : http://phpfiddle.org/main/code/r9n6-d0j0

I just want to get all "a" nodes that are in "article" nodes with the class "recherche-aphabetique-item".

But it returns me nothing :/.

As you can see in the commented code I've tried to replace html5 elements articles to div, but I got the same behavior.

Thanks four your help.

1 Answer 1

1

I'm seeing lots of DOMDocument::loadHTML(): Unexpected end tag errors - you should use the internal error handling functions of libxml to help fix this perhaps. Also, when I looked at the DOM of the remote site I could not see any a tags that would match the XPath query, only span tags

<?php $What = 'Call of duty'; $What = urlencode($What); $Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $Query); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20); $response = curl_exec($ch); curl_close($ch); /* try to suppress errors using libxml */ libxml_use_internal_errors( true ); $dom = new DOMDocument(); /* additional flags for DOMDocument */ $dom->validateOnParse=false; $dom->standalone=true; $dom->strictErrorChecking=false; $dom->recover=true; $dom->formatOutput=false; @$dom->loadHTML($response); libxml_clear_errors(); $xpath = new DOMXPath($dom); $elements = $xpath->query('//article[@class="recherche-aphabetique-item"]/span'); count( $elements ); var_dump( $elements ); ?> 

output

object(DOMNodeList)#97 (1) { ["length"]=> int(94) } 

You could further simplify this perhaps by trying:

$What = 'Call of duty'; $What = urlencode($What); $Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What; libxml_use_internal_errors( true ); $dom = new DOMDocument(); $dom->validateOnParse=false; $dom->standalone=true; $dom->strictErrorChecking=false; $dom->recover=true; $dom->formatOutput=false; @$dom->loadHTMLFile($Query); libxml_clear_errors(); $xpath = new DOMXPath($dom); $elements = $xpath->query('//article[@class="recherche-aphabetique-item"]/span'); count($elements); foreach( $elements as $node )echo $node->nodeValue,'<br />'; 
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks a lot, it works as expected with your code. May the end tag errors were the problems ? Just a thing that I dont understand, the tags in browser are "a" but they got replaced with "span" when we download the html file...
The replacement of the "a" tags by "span" prevents me to get the href link.
looking at the site again using Chrome this time rather than Firefox ( with javascript disabled ) the a tags are clearly present which suggests to me that the a tags are generated using javascript which presents a problem when scraping data
Yes, it's replaced with javascript, but I found the way to decode the links that are stored encoded in span class (for example : 1F4D43C3C51F4D43C31E24232C23261F). I'm trying to rewrite the code from JS to PHP actually.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.