scraping with curl

Question

I am trying to scrape some info from some websites using PHP CURL, the problem is it gives me wrong (different) content than opening it with normal browser

The example site is this: http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=2010091905576453

I am trying to get the meta tags, in the browser it returns as:

<meta name="title" content="Razmere v Preboldu se umirjajo" /> <meta name="description" content="Za prebivalci Prebolda je nemirna no&#269;, ki ji je sledilo jutro s &#353;e dodatnimi padavinami..." /> <link rel="image_src" href="http://web.vecer.com/portali/podatki/2010/09/19/slike/online_Prebold0-100.jpg" /> <link rel="target_url" href="http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=2010091905576453" />

but my curl gets this:

<title>VECER.COM: </title> <meta name="title" content="" /> <meta name="description" content="" /> <link rel="image_src" href="-100.jpg" /> <link rel="target_url" href="http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=1899123000000000">

here is my code:

function curl($url){ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6'); curl_setopt($ch, CURLOPT_HEADER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt"); curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt"); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30); curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com"); $data = curl_exec($ch); curl_close($ch); return $data; }

What I'm doing wrong?

i have no idea, i just copy-ed the from useragent to referer code from other samples, and nothing seemed to work — mire
– mire, Commented Jan 28, 2013 at 13:28
web servers sometimes send different replies depending on your user agent, but I'm just guessing your problem could be something else entirely. — dutt
– dutt, Commented Jan 28, 2013 at 15:32
i tried the same user agent (Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0) and it didnt work, it might have to do something with the cookies, but i have no clue if the cookie.txt setting is even working, i also tried adding some redirect option to true, no luck also — mire
– mire, Commented Jan 28, 2013 at 15:37
my problem was that i was using set_value('url') from codeigniter which for some security purposes encoded the weird characters in the url, all is solved now. also i recomend using the google bot as useragent. — mire
– mire, Commented Jan 28, 2013 at 22:50

Vipin Singh · Accepted Answer · 2014-06-19 14:19:19Z

hi for meta and all other attribute scraping you can use http://simplehtmldom.sourceforge.net/

$target_url = "http://stackoverflow.com/questions"; $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; // make the cURL request to $target_url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); //storeLink($url,$target_url); echo "<br />Link stored: $url"; }

Collectives™ on Stack Overflow

scraping with curl

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related