Universal-Web-Scraper

Scrapes data from any website provided with a template target for it to fetch. This project uses the Java libraries JSoup (http://jsoup.org/download), JSON.Simple (https://code.google.com/p/json-simple/), JSON in Java (http://www.json.org/java/), Selenium (http://www.seleniumhq.org/), PhantomJS (http://phantomjs.org/) and GhostDriver (https://github.com/detro/ghostdriver).

Example code:

String url = "http://www.reddit.com/"; String targets = "{'title':'head > title', 'content':'.thing'}"; // These css selectors point to which elements to scrape WebScraper result = new WebScraper(url, Method.GET, targets); // Scrape the data result.print(); // Print 100% of data Element element = result.elem("content", 2); // Select 'content' and get 3rd element System.out.println(element.text()); // Output the text System.out.println(element.className()); // Output the class name

And if you're feeling like a one liner:

Elements[] result = new WebScraper("www.reddit.com", Method.GET, "{'title':'head > title', 'content':'.thing'}").allElems();

It also handles authentication:

String urlLogin = "https://www.reddit.com/api/login/yourusername/"; // This url is the actual login page which authenticates and returns the session cookies String urlHome = "http://www.reddit.com/"; // Home url, reddit is being used as an example WebScraper result = new WebScraper( urlLogin, urlHome, Method.GET, "{op : login-main, api_type : json, user : yourusername, passwd : yourpassword}", // These are the headers required for the login process "{'content':'html'}" // Fetch the root element ); System.out.println(result); // Output the result after authentication

If you want to scrape from a more complex website (like facebook) AND also take a screenshot of it, you can do it like this:

String urlLogin = "https://www.facebook.com/login.php?login_attempt=1"; String urlHome = "https://www.facebook.com/"; // Initialize: WebScraper result = new WebScraper( urlLogin, urlHome, true, "{email: youremail, pass: yourpassword, persistent: 1, default_persistent: 1, timezone: -60, locale: pt_PT}", "{lsd, lgndim, lgnrnd, lgnjs, qsstamp}" ); // Set callback: result.setProperty(WebScraper.Props.ENGINE_GET_CALLBACK, new EngineCallback() { public void after_get(PhantomJS ctx) { // Take screenshot of your facebook page ctx.take_screenshot("C:\\Users\\Me\\Desktop\\screenshot.png", true);	} public void before_get(PhantomJS ctx) { } }); // Scrape the whole document: result.scrape(urlHome, Method.GET, "{'html':'html'}") .export("C:\\Users\\Me\\Desktop\\scraped.json", false, true); System.out.println("Done scraping");

If you don't want to enter manually the parameters of the POST request for authentication, you can do it manually, which also means you don't have do write your password in your code:

String urlLogin = "https://www.facebook.com/login.php?login_attempt=1"; String urlHome = "https://www.facebook.com"; WebScraper.manual_auth = true; // Authenticate manually. A firefox window will popup and will wait for you to login on the website WebScraper result = new WebScraper(urlLogin,urlHome, true); result.scrape(urlHome, urlLogin, null, "{'html':'html'}"); result.export("C:\\Users\\Me\\Desktop\\out.json", false, true);

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.settings		.settings
lib		lib
src/com/org		src/com/org
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal-Web-Scraper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Universal-Web-Scraper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages