Scrapingtotherescue (Webscrapingusingpython) By : Satwik Kansal and Pradhvan Bisht
Whatiswebscraping ? Web scraping is a technique to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. The data can be used for several purposes like displaying on your own website and application, performing data analysis or for any other reason.
whyshouldyouscrape - API may not provide what you need - No rate limit - Take what you really want! - Reduces manual effort - Swag!
Thingsthatmightcomehandy -HTML -CSS -XPATH -Regular Expressions
Howit’sdone? Broadly a Three Step Process 1. Getting the content (in most cases HTML) 2. Parsing the response. 3. Optimizing/Improving the performance and preserving the data
GETTINGTHECONTENT ● Using modules like urllib, urllib2, requests, mechanize and selenium. ● Involves GET/POST request to the server. ● The response contains the information to be extracted. ● Sometimes not as easy as it may seem.
ExtractingTheData 1. Using Regular Expression and Basic python Tricky, complex and kind of fragile. 2. Using Parsing Libraries ❏ Two different approaches possible -- Simple Parsing and Search Tree parsing. ❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib. ❏ Each modules has its own techniques and thus its own pros and trade- offs
ComparingParsers BEAUTIFUL SOUP LXML SCRAPY HTML5LIB
PreservingTheData 1. Writing to a file. 2. Exporting as csv or excel file. 3. Storing in a database.
Examples Example 1 : Scraping Tweets from Twitter using BeautifulSoup and python’s Requests module Code Example 2 : Scraping top Stackoverflow posts using Scrapy Code Example 3 : Using Selenium to Log in and fetch library details from a university library site which uses Dynamic HTML.
WHATTOUSEWHERE 1. Handling dynamically generated html Solutions: Selenium or Spidermonkey 2. Cookie based Authentication Solution : Requests module. 3. Simple scraping Solutions: BeautifulSoup+Requests, Scrapy, Selenium
Scrapinghacks 1. Overcoming captchas Lookup tables, One time manual entry , Death By Captchas (paid service) 2. Per IP address query limit Using tsocks, ssh_D and socks monkey. 3. Improving performance Multiprocessing , gevent and requests.async() method.
Example3 Automating My College Library Problems : 1. Authentication 2. Dynamically Generated <iframe> tag Solution Selenium with headless Browser like PhantomJS Alternative: Mechanize Code
EthicsOfScraping Exceeding authorized use of the site Means doing anything that is prohibited in the Terms of Use (See CFAA, breach of contract, unjust enrichment, trespass to chattels, and various state laws similar to CFAA) Copyright Issues If the material you are scraping is not factual, but something that required some amount of creativity to create, you have copyright to worry about. QuickTip -- Conform to the the robots.txt file.
● The brute-force way to get the information required. ● Absolutely Legal ● Not always that easy.

Getting started with Web Scraping in Python

  • 1.
  • 2.
    Whatiswebscraping ? Web scrapingis a technique to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. The data can be used for several purposes like displaying on your own website and application, performing data analysis or for any other reason.
  • 4.
    whyshouldyouscrape - API maynot provide what you need - No rate limit - Take what you really want! - Reduces manual effort - Swag!
  • 5.
  • 6.
    Howit’sdone? Broadly a ThreeStep Process 1. Getting the content (in most cases HTML) 2. Parsing the response. 3. Optimizing/Improving the performance and preserving the data
  • 7.
    GETTINGTHECONTENT ● Using moduleslike urllib, urllib2, requests, mechanize and selenium. ● Involves GET/POST request to the server. ● The response contains the information to be extracted. ● Sometimes not as easy as it may seem.
  • 8.
    ExtractingTheData 1. Using RegularExpression and Basic python Tricky, complex and kind of fragile. 2. Using Parsing Libraries ❏ Two different approaches possible -- Simple Parsing and Search Tree parsing. ❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib. ❏ Each modules has its own techniques and thus its own pros and trade- offs
  • 10.
  • 11.
    PreservingTheData 1. Writing toa file. 2. Exporting as csv or excel file. 3. Storing in a database.
  • 12.
    Examples Example 1 :Scraping Tweets from Twitter using BeautifulSoup and python’s Requests module Code Example 2 : Scraping top Stackoverflow posts using Scrapy Code Example 3 : Using Selenium to Log in and fetch library details from a university library site which uses Dynamic HTML.
  • 14.
    WHATTOUSEWHERE 1. Handling dynamicallygenerated html Solutions: Selenium or Spidermonkey 2. Cookie based Authentication Solution : Requests module. 3. Simple scraping Solutions: BeautifulSoup+Requests, Scrapy, Selenium
  • 16.
    Scrapinghacks 1. Overcoming captchas Lookuptables, One time manual entry , Death By Captchas (paid service) 2. Per IP address query limit Using tsocks, ssh_D and socks monkey. 3. Improving performance Multiprocessing , gevent and requests.async() method.
  • 17.
    Example3 Automating My CollegeLibrary Problems : 1. Authentication 2. Dynamically Generated <iframe> tag Solution Selenium with headless Browser like PhantomJS Alternative: Mechanize Code
  • 19.
    EthicsOfScraping Exceeding authorized useof the site Means doing anything that is prohibited in the Terms of Use (See CFAA, breach of contract, unjust enrichment, trespass to chattels, and various state laws similar to CFAA) Copyright Issues If the material you are scraping is not factual, but something that required some amount of creativity to create, you have copyright to worry about. QuickTip -- Conform to the the robots.txt file.
  • 21.
    ● The brute-forceway to get the information required. ● Absolutely Legal ● Not always that easy.