Getting started with Web Scraping in Python

Scrapingtotherescue (Webscrapingusingpython) By : Satwik Kansal and Pradhvan Bisht

Whatiswebscraping ? Web scraping is a technique to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. The data can be used for several purposes like displaying on your own website and application, performing data analysis or for any other reason.

whyshouldyouscrape - API may not provide what you need - No rate limit - Take what you really want! - Reduces manual effort - Swag!

Thingsthatmightcomehandy -HTML -CSS -XPATH -Regular Expressions

Howit’sdone? Broadly a Three Step Process 1. Getting the content (in most cases HTML) 2. Parsing the response. 3. Optimizing/Improving the performance and preserving the data

GETTINGTHECONTENT ● Using modules like urllib, urllib2, requests, mechanize and selenium. ● Involves GET/POST request to the server. ● The response contains the information to be extracted. ● Sometimes not as easy as it may seem.

ExtractingTheData 1. Using Regular Expression and Basic python Tricky, complex and kind of fragile. 2. Using Parsing Libraries ❏ Two different approaches possible -- Simple Parsing and Search Tree parsing. ❏ Some popular libraries are BeautifulSoup, Lxml, and html5lib. ❏ Each modules has its own techniques and thus its own pros and trade- offs

ComparingParsers BEAUTIFUL SOUP LXML SCRAPY HTML5LIB

PreservingTheData 1. Writing to a file. 2. Exporting as csv or excel file. 3. Storing in a database.

Examples Example 1 : Scraping Tweets from Twitter using BeautifulSoup and python’s Requests module Code Example 2 : Scraping top Stackoverflow posts using Scrapy Code Example 3 : Using Selenium to Log in and fetch library details from a university library site which uses Dynamic HTML.

WHATTOUSEWHERE 1. Handling dynamically generated html Solutions: Selenium or Spidermonkey 2. Cookie based Authentication Solution : Requests module. 3. Simple scraping Solutions: BeautifulSoup+Requests, Scrapy, Selenium

Scrapinghacks 1. Overcoming captchas Lookup tables, One time manual entry , Death By Captchas (paid service) 2. Per IP address query limit Using tsocks, ssh_D and socks monkey. 3. Improving performance Multiprocessing , gevent and requests.async() method.

Example3 Automating My College Library Problems : 1. Authentication 2. Dynamically Generated <iframe> tag Solution Selenium with headless Browser like PhantomJS Alternative: Mechanize Code

EthicsOfScraping Exceeding authorized use of the site Means doing anything that is prohibited in the Terms of Use (See CFAA, breach of contract, unjust enrichment, trespass to chattels, and various state laws similar to CFAA) Copyright Issues If the material you are scraping is not factual, but something that required some amount of creativity to create, you have copyright to worry about. QuickTip -- Conform to the the robots.txt file.

● The brute-force way to get the information required. ● Absolutely Legal ● Not always that easy.

Getting started with Web Scraping in Python

In this document

More Related Content

What's hot

Similar to Getting started with Web Scraping in Python

Recently uploaded

Getting started with Web Scraping in Python