0

I am attempting to write a script that can retrieve the HTML from my school's schedule search webpage. I am able to visit the web page normally when I visit it using a browser, but when I try to get it to work using cURL, it gets the HTML from the redirected page. When I changed the

CURLOPT_FOLLOWLOCATION 

variable from true to false, it only outputs a blank page with the headers sent.

For reference, my PHP code is

<?php $curl_connection = curl_init('https://www.registrar.usf.edu/ssearch/'); curl_setopt($curl_connection, CURLOPT_CONNECTTIMEOUT, 30); curl_setopt($curl_connection, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"); curl_setopt($curl_connection, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl_connection, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($curl_connection, CURLOPT_FOLLOWLOCATION, false); curl_setopt($curl_connection, CURLOPT_HEADER, true); curl_setopt($curl_connection, CURLOPT_REFERER, "https://www.registrar.usf.edu/"); $result = curl_exec($curl_connection); print $result; ?> 

The website that I am trying to get the HTML of from cURL is https://www.registrar.usf.edu/ssearch/ or https://www.registrar.usf.edu/ssearch/search.php

Any ideas?

3
  • The page is pushing a couple of cookies: cookie_test=cookie_set; PHPSESSID=nijdlbfqe2dfqqege40eh7lai4 Commented May 9, 2012 at 6:30
  • Should I get cURL to accept cookies and see if that works? Commented May 9, 2012 at 6:32
  • @cacidol - Yes, you should. Added an answer already. Commented May 9, 2012 at 6:42

1 Answer 1

3

I added 2 lines more, which now saves cookies which decides whether to redirect you when you try scraping the shedule's page.

$curl_connection = curl_init(); $url = "https://www.registrar.usf.edu/ssearch/search.php"; curl_setopt($curl_connection, CURLOPT_URL, $url); curl_setopt($curl_connection, CURLOPT_CONNECTTIMEOUT, 30); curl_setopt($curl_connection, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"); curl_setopt($curl_connection, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl_connection, CURLOPT_SSL_VERIFYPEER, false); curl_setopt ($curl_connection, CURLOPT_COOKIEJAR, 'cookie.txt');//cookiejar to dump cookie infos. curl_setopt ($curl_connection, CURLOPT_COOKIEFILE, 'cookie.txt');//cookie file for further reference from the site curl_setopt($curl_connection, CURLOPT_FOLLOWLOCATION, true); curl_setopt($curl_connection, CURLOPT_HEADER, true); curl_setopt($curl_connection, CURLOPT_REFERER, "https://www.registrar.usf.edu/"); $result = curl_exec($curl_connection); echo $result; 

Also, I havent seen anyone putting urls in curl_init yet.

Here is the cookie :

# Netscape HTTP Cookie File # http://curl.haxx.se/rfc/cookie_spec.html # This file was generated by libcurl! Edit at your own risk. www.registrar.usf.edu FALSE / FALSE 0 PHPSESSID eied78t0v1qlqcop0rdk214361 www.registrar.usf.edu FALSE /ssearch/ FALSE 1336718465 cookie_test cookie_set 

If you ever wanna debug a non working curl stuff, start with var_dump(curl_getinfo($curl_connection)); and next one to check is curl_error($curl_connection);

Sign up to request clarification or add additional context in comments.

2 Comments

Great, that works, thanks! I looked at my cookie file but it doesn't look like it's written anything. I guess the website is looking that I can accept cookies but doesn't need them for anything useful. Do you think it's weird that it redirects to the main page instead of asking to accept cookies?
Its moreover like, if the site cant read the cookies back from us, redirect to home. So even if we save cookies, if we remove curlopt_cookiefile from where the site reads the cookie, it will redirect us to home. Maybe, your uni needs protection for people who hate cookies :D

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.