Scrapes data from any website provided with a template target for it to fetch. This project uses the Java libraries JSoup (http://jsoup.org/download), JSON.Simple (https://code.google.com/p/json-simple/), JSON in Java (http://www.json.org/java/), Selenium (http://www.seleniumhq.org/), PhantomJS (http://phantomjs.org/) and GhostDriver (https://github.com/detro/ghostdriver).
Example code:
String url = "http://www.reddit.com/"; String targets = "{'title':'head > title', 'content':'.thing'}"; // These css selectors point to which elements to scrape WebScraper result = new WebScraper(url, Method.GET, targets); // Scrape the data result.print(); // Print 100% of data Element element = result.elem("content", 2); // Select 'content' and get 3rd element System.out.println(element.text()); // Output the text System.out.println(element.className()); // Output the class nameAnd if you're feeling like a one liner:
Elements[] result = new WebScraper("www.reddit.com", Method.GET, "{'title':'head > title', 'content':'.thing'}").allElems();It also handles authentication:
String urlLogin = "https://www.reddit.com/api/login/yourusername/"; // This url is the actual login page which authenticates and returns the session cookies String urlHome = "http://www.reddit.com/"; // Home url, reddit is being used as an example WebScraper result = new WebScraper( urlLogin, urlHome, Method.GET, "{op : login-main, api_type : json, user : yourusername, passwd : yourpassword}", // These are the headers required for the login process "{'content':'html'}" // Fetch the root element ); System.out.println(result); // Output the result after authenticationIf you want to scrape from a more complex website (like facebook) AND also take a screenshot of it, you can do it like this:
String urlLogin = "https://www.facebook.com/login.php?login_attempt=1"; String urlHome = "https://www.facebook.com/"; // Initialize: WebScraper result = new WebScraper( urlLogin, urlHome, true, "{email: youremail, pass: yourpassword, persistent: 1, default_persistent: 1, timezone: -60, locale: pt_PT}", "{lsd, lgndim, lgnrnd, lgnjs, qsstamp}" ); // Set callback: result.setProperty(WebScraper.Props.ENGINE_GET_CALLBACK, new EngineCallback() { public void after_get(PhantomJS ctx) { // Take screenshot of your facebook page ctx.take_screenshot("C:\\Users\\Me\\Desktop\\screenshot.png", true); } public void before_get(PhantomJS ctx) { } }); // Scrape the whole document: result.scrape(urlHome, Method.GET, "{'html':'html'}") .export("C:\\Users\\Me\\Desktop\\scraped.json", false, true); System.out.println("Done scraping");If you don't want to enter manually the parameters of the POST request for authentication, you can do it manually, which also means you don't have do write your password in your code:
String urlLogin = "https://www.facebook.com/login.php?login_attempt=1"; String urlHome = "https://www.facebook.com"; WebScraper.manual_auth = true; // Authenticate manually. A firefox window will popup and will wait for you to login on the website WebScraper result = new WebScraper(urlLogin,urlHome, true); result.scrape(urlHome, urlLogin, null, "{'html':'html'}"); result.export("C:\\Users\\Me\\Desktop\\out.json", false, true);