1

First of all have a look at here,

www.zedge.net/txts/4519/ 

this page has so many text messages , I want my script to open each of the message and download it, but i am having some problem,

This is my simple script to open the page,

<?php $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519"); $contents = curl_exec ($ch); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_close ($ch); ?> 

The page download fine but how would i open every text message page inside this page one by one and save its content in a text file, I know how to save the content of a webpage in a text file using curl but in this case there are so many different pages inside the page i've downloaded how to open them one by one seperately ?

I've this idea but don't know if it will work,

Downlaod this page,

www.zedge.net/txts/4519 

look for the all the links of text messages page inside the page and save each link into one text file (one in each line), then run another curl session , open the text file read each link one by one , open it copy the content from the particular DIV and then save it in a new file.

2 Answers 2

3

The algorithm is pretty straight forward:

  • download www.zedge.net/txts/4519 with curl
  • parse it with DOM (or alternative) for links
  • either store them all into text file/database or process them on the fly with "subrequest"

 

// Load main page $ch = curl_init(); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519"); $contents = curl_exec ($ch); $dom = new DOMDocument(); $dom->loadHTML( $contents); // Filter all the links $xPath = new DOMXPath( $dom); $items = $xPath->query( '//a[class=myLink]'); foreach( $items as $link){ $url = $link->getAttribute('href'); if( strncmp( $url, 'http', 4) != 0){ // Prepend http:// or something } // Open sub request curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519"); $subContent = curl_exec( $ch); } 

See documentation and examples for xPath::query, note that DOMNodeList implements Traversable and therefor you can use foreach.

Tips:

  • Use curl opt COOKIE_JAR_FILE
  • Use sleep(...) not to flood server
  • Set php time and memory limit
Sign up to request clarification or add additional context in comments.

4 Comments

i don't know much about DOM , when i open the link you've provided for DOM there are infinite things to read :O , btw i tried your code it shows an error Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Empty string supplied as input in E:\Installations\xampp\htdocs\wp\test1.php on line 7
@Xufyan check content of $contents (var_dump($contents))
I didn't get you , searched alot :/ could you please explain
At first you have to have correct $contents so you have to check whether $contents is empty or not... (curl request was successful) also php.net/manual/en/function.curl-getinfo.php this with CURLINFO_HTTP_CODE should return 200
2

I used DOM for my code part. I called my desire page and filtered data using getElementsByTagName('td') Here i want the status of my relays from the device page. every time i want updated status of relays. for that i used below code.

$keywords = array(); $domain = array('http://USERNAME:PASSWORD@URL/index.htm'); $doc = new DOMDocument; $doc->preserveWhiteSpace = FALSE; foreach ($domain as $key => $value) { @$doc->loadHTMLFile($value); //$anchor_tags = $doc->getElementsByTagName('table'); //$anchor_tags = $doc->getElementsByTagName('tr'); $anchor_tags = $doc->getElementsByTagName('td'); foreach ($anchor_tags as $tag) { $keywords[] = strtolower($tag->nodeValue); //echo $keywords[0]; } } 

Then i get my desired relay name and status in $keywords[] array. Here i am sharing screenshot of Output.

If you want to read all messages in the main page. then first you have to collect all link for separate messages. Then you can use it for further same process.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.