3

I've downloaded this page via cURL, and the price on the page ( $118.09 ) does not show up in the source via cURL. When I view the source on the same page with my browser ( Chrome ), the price is there. All the other product attributes are there in the cURL source ( part number, description, case qty, etc ).

Any thoughts on what's happening?

Here are my cURL settings:

$options = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_HEADER => false, CURLOPT_FOLLOWLOCATION => true, CURLOPT_ENCODING => "", CURLOPT_AUTOREFERER => true, CURLOPT_CONNECTTIMEOUT => 10, CURLOPT_TIMEOUT => 5, CURLOPT_MAXREDIRS => 5, CURLOPT_USERAGENT => "http://www.industrycortex.com/crawler.php" ); 

NOTES:

It's been pointed out that this site does not display a price ( see screenshot below ) until the user visits /home. I've tested this and it is correct. The website produces a cookie that I was not passing with cURL. Further, the webserver tracks if the session id of the user has visited /home, and only shows prices if that session id has. The cookie produced by a visit to /home is identical to the cookie produced by any other page.


enter image description here

2
  • Johnny Grabber's answer is likely correct (I regularly use the useragent string to determine how content gets served), however, I just wanted to note that link you provided (quickscrews.com/Part/4080XL) doesn't list any price. Are you sure you are cURLing the correct page? Commented Jan 3, 2012 at 22:44
  • Some sites don't want to be scraped. They use all sorts of devious tactics, including CSRF defense mechanisms. Commented Jan 3, 2012 at 22:48

4 Answers 4

2

The price seems to show after you access /home (without logging in) and come back. That's a strange protection mechanism, but it's easily circumvented. All you need is to do exactly that with your cURL session:

  1. Set CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR to the same file (I might be wrong about it being required, but it certainly won't harm).
  2. Set URL to http://www.quickscrews.com/home and do curl_exec()
  3. Proceed with scraping.

The price should show now, unless the cookie is set with JS. In that case, you will have to read cookies from your browser and write them to CURLOPT_COOKIE.

P.S. I'm guessing the cookie is sawRegPg=sawit;. You can try just setting CURLOPT_COOKIE to that and see what happens.

Sign up to request clarification or add additional context in comments.

1 Comment

In this particular case, it seems the webserver tracks if my session id has requested /home. Just having this data in my cookie-jar file does not solve the problem until I have visited /home. /home does not produce a unique or different cookie. Otherwise... this answer is of great help.
2

I've tried to cover this question a bit in section 14 of the The Art Of Scripting HTTP Requests Using Curl document. Sites can do all sorts of checks and logic that will differ with "plain" curl usage compared to you using a browser.

Your work is then to record the browser session (with something like LiveHTTPHeaders or Firebug) and then work on making your curl usage and command line mimic the look of the browser session as closely as possible. That includes user-agent, referrers and probably most of all cookies.

1 Comment

This article is very helpful with this particular question.
0

Some sites render their pages differently for a browser and a crawler. Did you tried to set another user agent in cURL?

Edit: I can't see the price on the page. It may be that you are logged in and can see the price therefore and cURL (and I) are not logged in.

1 Comment

I just tried the Windows NT on x86 and Mac OS X on Intel x86 or x86_64 user agent strings from this page and it still didn't render a price in cURL: developer.mozilla.org/en/Gecko_user_agent_string_reference
0

I ran into a site where they were sending content with gzip encoding, which cURL was not automatically decoding. Another thing that can help is to get the user-agent of your browser by visiting http://www.whatsmyuseragent.com/ and then using that as part of your command.

curl -A "USER_AGENT" "URL_YOU_NEED_TO_GET" | gzip -d > out.html

I understand that the issue in the particular case was with cookies, and probably not the command line curl, but I hit this issue when I was trying to figure out what I thought was the same thing and adding the gzip -d definitely fixed it for me.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.