1

I need to make a crawler. For http request i used to do this.

 var http=require('http'); var options={ host:'http://www.example.com', path:'/foo/example' }; callback=function(response){ var str=''; response.on('data',function(chunk){ str+=chunk; }); response.on('end', function () { console.log(str); }); } http.request(options, callback).end(); 

but I have to make a crawler for https://example.com/foo/example If I am using the same for https://example.com/foo/example it is giving this error

 events.js:72 throw er; // Unhandled 'error' event ^ Error: getaddrinfo ENOTFOUND at errnoException (dns.js:37:11) at Object.onanswer [as oncomplete] (dns.js:124:16) 
1

1 Answer 1

3

I'd recommend this excellent HTTP Request module: http://unirest.io/nodejs.html

You can install it with:

npm install -g unirest

Here's some example Node code with Unirest:

 var url = 'https://somewhere.com/'; unirest.get(url) .end(function(response) { var body = response.body; // TODO: parse the body done(); }); 

...so to get the HTML at www.purple.com you'd do this:

#!/usr/bin/env node function getHTML(url, next) { var unirest = require('unirest'); unirest.get(url) .end(function(response) { var body = response.body; if (next) next(body); }); } getHTML('http://purple.com/', function(html) { console.log(html); }); 
Sign up to request clarification or add additional context in comments.

3 Comments

The data (elements and their attributes) which i am getting is not the same as it is visible in the inspect element. It is totally different or you can say encoded.
unirest.get(url) will get the text data at a URL. When you inspect the page, you're looking at the page after JavaScript has run over it - so you're not seeing the raw HTML - you're seeing the DOM after JavaScript mods.
so can you the tell the way to see the raw html. Thanks

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.