4

I am working on a project that involves fetching pages with cURL or file_get_contents. The problem is that when i try to echo the html fetched, the output seem to be different from the original page, not all images show up. Please i would like to know if there is a solution. My code

 <?php //Get the url $url = "http://www.google.com"; //Get the html of url function get_data($url) { $ch = curl_init(); $timeout = 5; //$userAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US)AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.X.Y.Z Safari/525.13."; $userAgent = "IE 7 – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"; curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout); $data = curl_exec($ch); curl_close($ch); return $data; } $html = file_get_contents($url); echo $html; ?> 

Thanks

2
  • assuming you're not fetching google - but you're not providing enough information to help. Give the actual page fetched and show what you expect vs what you get, perhaps (a small sample, not the entire page) Commented Aug 19, 2010 at 3:57
  • 1
    are the images that don't show up relative locations? /images/blah.jpg wouldn't show up if rendered locally, but foo.com/images/blah.jpg would be able to find the correct image Commented Aug 19, 2010 at 4:43

2 Answers 2

8

You should use <base> to specify a base url for all relative links:

If you curl http://example.com/thisPage.html then add a base tag in your echoed output of ''. This should technically be in the <head>, but this will work:

echo '<base href="http://example.com/" />'; echo $html; 

Live example w <base> is broken w/o <base>

Sign up to request clarification or add additional context in comments.

2 Comments

Brilliant - much better than rewriting all links manually. Here's an easy way to put the <base> into the right place: $response = preg_replace("/<head>/i", "<head><base href='$url' />", $response, 1);
This solved as issue that took me 3 hours to find! Thank you!
1

Use this

 //Get the html of url function get_data($url) { $ch = curl_init(); $timeout = 5; //$userAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US)AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.X.Y.Z Safari/525.13."; $userAgent = "IE 7 – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"; curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout); $data = curl_exec($ch); curl_close($ch); return $data; } $parse = parse_url($url); $count = "http://".$parse['host'].dirname($parse['path'])."//"; $page = str_replace("<head>", "<head>\n<base href=\"" . $count . "\" />", $page); $page = str_replace("<HEAD>", "<head>\n<base href=\"" . $count . "\" />", $page); echo $page; ?> 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.