Grabbing HTML From a Page That Has Blocked CURL

Question

I have been asked to grab a certain line from a page but it appears that site has blocked CURL requests?

The site in question is http://www.habbo.com/home/Intricat

I tried changing the UserAgent to see if they were blocking that but it didn't seem to do the trick.

The code I am using is as follows:

<?php $curl_handle=curl_init(); //This is the URL you would like the content grabbed from curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0"); curl_setopt($curl_handle,CURLOPT_URL,'http://www.habbo.com/home/Intricat'); //This is the amount of time in seconds until it times out, this is useful if the server you are requesting data from is down. This way you can offer a "sorry page" curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2); curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1); $buffer = curl_exec($curl_handle); //This Keeps everything running smoothly curl_close($curl_handle); // Change the message bellow as you wish, please keep in mind you must have your message within the " " Quotes. if (empty($buffer)) { print "Sorry, It seems our weather resources are currently unavailable, please check back later."; } else { print $buffer; } ?>

Any ideas on another way I can grab a line of code from that page if they've blocked CURL requests?

EDIT: On running curl -i through my server, it appears that the site is setting a cookie first?

"our weather resources"? - I'm pretty sure you meant the weather resources of habbo.com, right? — hakre
– hakre, Commented Nov 2, 2012 at 16:41
Just seeing, it's a browser game. Looking for cheats? I'm pretty sure they made it that way for a reason. If you really want to fiddle with it, you will have to learn some more of the basics I'd say ;) — hakre
– hakre, Commented Nov 2, 2012 at 16:44
Nothing to do with cheats. Im grabbing someone's motto of their homepage. — Tenatious
– Tenatious, Commented Nov 2, 2012 at 16:49

hakre · Accepted Answer · 2012-11-02 16:38:37Z

You are not very specific about the kind of block you're talking. The website in question http://www.habbo.com/home/Intricat does first of all check if the browser has javascript enabled:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta http-equiv="Content-Script-Type" content="text/javascript"> <script type="text/javascript">function setCookie(c_name, value, expiredays) { var exdate = new Date(); exdate.setDate(exdate.getDate() + expiredays); document.cookie = c_name + "=" + escape(value) + ((expiredays == null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/"; } function getHostUri() { var loc = document.location; return loc.toString(); } setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '179.222.19.192', 10); setCookie('DOAReferrer', document.referrer, 10); location.href = getHostUri();</script> </head> <body> <noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser. </noscript> </body> </html>

As curl has no javascript support you either need to use a HTTP client that has -or- you need to mimic that script and create the cookie and new request URI your own.

You mimic that by reading the javascript code and then understanding what it does. You then transform that knowledge into PHP code and into curl request configuration. You just do the work of the javascript in the browser so to say. Just in PHP instead of javascript and compatible to curl. You might need to parse the HTML and javascript. For HTML parsing I highly suggest PHP's DOMDocument. First lesson is to extract the text of the <script> tag here.
Surely if I do a CURL request now, I should get that script tag returned? Instead, the page just constantly loads?
First sentence with question mark: Yes. Second sentence with question mark: No.

Yoni Hassin · Accepted Answer · 2012-11-02 16:37:32Z

go in with your browser and copy the exact headers that are being send, the site won't be able to tell that your are trying to curl because the request will look exactly the same. if cookies are used - attach them as headers.

Waygood · Accepted Answer · 2012-11-02 16:38:04Z

This is a cut and paste from my Curl class I did quite a few years back, hope you can pick some gems out of it for yourself.

function get_url($url) { curl_setopt ($this->ch, CURLOPT_URL, $url); curl_setopt ($this->ch, CURLOPT_USERAGENT, $this->user_agent); curl_setopt ($this->ch, CURLOPT_COOKIEFILE, $this->cookie_name); curl_setopt ($this->ch, CURLOPT_COOKIEJAR, $this->cookie_name); if(!is_null($this->referer)) { curl_setopt ($this->ch, CURLOPT_REFERER, $this->referer); } curl_setopt ($this->ch, CURLOPT_SSL_VERIFYHOST, 2); curl_setopt ($this->ch, CURLOPT_HEADER, 0); if($this->follow) { curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 1); } else { curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 0); } curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($this->ch, CURLOPT_HTTPHEADER, array("Accept: text/html,text/vnd.wap.wml,*.*")); curl_setopt ($this->ch, CURLOPT_SSL_VERIFYPEER, FALSE); // this line makes it work under https $try=0; $result=""; while( ($try<=$this->retry_attempts) && (empty($result)) ) // force a retry upto 5 times { $try++; $result = curl_exec($this->ch); $this->response=curl_getinfo($this->ch); // $response['http_code'] 4xx is an error } // set refering URL to current url for next page. if($this->referer_to_last) $this->set_referer($url); return $result; }

$cookie_name="./cookie"; ensuring your script has write access to the directory you choose

PavoDive · Accepted Answer · 2015-08-10 22:20:34Z

I know this is a very old post, but since I had to answer myself the same question today, here I share it for people coming, it may be of use to them. I'm also fully aware the OP asked for curl specifically, but --just like me-- there could be people interested in a solution, no matter if curl or not.

The page I wanted to get with curl blocked it. If the block is not because javascript, but because of the agent (that was my case, and setting the agent in curl didn't help), then wget could be a solution:

wget -o output.txt --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" "http://example.com/page"

Collectives™ on Stack Overflow

Grabbing HTML From a Page That Has Blocked CURL

4 Answers 4

4 Comments

1 Comment

3 Comments

Comments

Hot Network Questions