The document explains web scraping as a method for extracting large volumes of data from websites into local files, emphasizing its utility for various applications. It details the three main steps of web scraping: getting content, parsing the response, and preserving the data, while outlining tools and libraries available like BeautifulSoup and Scrapy. Additionally, it addresses challenges, ethical considerations, and offers examples of practical applications, stressing the importance of conforming to a site's terms of use.
Whatiswebscraping ? Web scrapingis a technique to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. The data can be used for several purposes like displaying on your own website and application, performing data analysis or for any other reason.
4.
whyshouldyouscrape - API maynot provide what you need - No rate limit - Take what you really want! - Reduces manual effort - Swag!
Howit’sdone? Broadly a ThreeStep Process 1. Getting the content (in most cases HTML) 2. Parsing the response. 3. Optimizing/Improving the performance and preserving the data
7.
GETTINGTHECONTENT ● Using moduleslike urllib, urllib2, requests, mechanize and selenium. ● Involves GET/POST request to the server. ● The response contains the information to be extracted. ● Sometimes not as easy as it may seem.
8.
ExtractingTheData 1. Using RegularExpression and Basic python Tricky, complex and kind of fragile. 2. Using Parsing Libraries ❏ Two different approaches possible -- Simple Parsing and Search Tree parsing. ❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib. ❏ Each modules has its own techniques and thus its own pros and trade- offs
Examples Example 1 :Scraping Tweets from Twitter using BeautifulSoup and python’s Requests module Code Example 2 : Scraping top Stackoverflow posts using Scrapy Code Example 3 : Using Selenium to Log in and fetch library details from a university library site which uses Dynamic HTML.
14.
WHATTOUSEWHERE 1. Handling dynamicallygenerated html Solutions: Selenium or Spidermonkey 2. Cookie based Authentication Solution : Requests module. 3. Simple scraping Solutions: BeautifulSoup+Requests, Scrapy, Selenium
16.
Scrapinghacks 1. Overcoming captchas Lookuptables, One time manual entry , Death By Captchas (paid service) 2. Per IP address query limit Using tsocks, ssh_D and socks monkey. 3. Improving performance Multiprocessing , gevent and requests.async() method.
17.
Example3 Automating My CollegeLibrary Problems : 1. Authentication 2. Dynamically Generated <iframe> tag Solution Selenium with headless Browser like PhantomJS Alternative: Mechanize Code
19.
EthicsOfScraping Exceeding authorized useof the site Means doing anything that is prohibited in the Terms of Use (See CFAA, breach of contract, unjust enrichment, trespass to chattels, and various state laws similar to CFAA) Copyright Issues If the material you are scraping is not factual, but something that required some amount of creativity to create, you have copyright to worry about. QuickTip -- Conform to the the robots.txt file.
21.
● The brute-forceway to get the information required. ● Absolutely Legal ● Not always that easy.