Web Scraping in Python with Scrapy

Web Scraping in Python with Scrapy Kota Kato @orangain 2015-09-08, 鮨会

Who am I? • Kota Kato • @orangain • Software Engineer • Interested in automation such as Jenkins, Chef, Docker etc.

Deﬁnition: Web Scraping • Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Web scraping - Wikipedia, the free encyclopedia  https://en.wikipedia.org/wiki/Web_scraping

eBook-1 • Cross-store search engine for ebooks. • Retrieve ebook data from 9 ebook stores. http://ebook-1.com/

QB Meter • Visualize crowdedness of QB HOUSE, 10 minutes barbershop. • Retrieve crowdedness from QB HOUSE's Web site every 5 minutes. http://qbmeter.capybala.com/

Prototype of Glance • Prototype of simple news app like newspaper. • Retrieve news from NHK NEWS WEB 4 times per a day.

Pokedos • Web app to ﬁnd nearest bus stops to see the arrival information of buses. • Retrieve location of the all bus stops in Kyoto- city. http://bus.capybala.com/

Why Web Scraping? • For Web Developer: • Develop mash-up application. • For Data Analyst: • Retrieve data to analyze. • For Everybody: • Automate operation of web sites.

Why Use Python? • Easy to use • Powerful libraries, especially Scrapy • Seamlessness between data processing and developing application

Web Scraping in Python • Combination of lightweight libraries: • Retrieving: Requests • Scraping: lxml, Beautiful Soup • Full stack framework: • Scrapy Today's topic

Scrapy • Fast, simple and extensible Web scraping framework in Python • Currently compatible only with Python 2.7 • In-progress Python 3 support • Maintained by Scrapinghub • BSD License http://scrapy.org/

Why Use Scrapy? • Annoying stuffs in crawling and scraping are done by Scrapy. Extracting Links Throttling Concurrency robots.txt and <meta> Tags XML Sitemaps Filtering Duplicated URLs Retry on Error Job Control

Getting Started with Scrapy $ pip install scrapy $ cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['http://blog.scrapinghub.com'] def parse(self, response): for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'): yield scrapy.Request(response.urljoin(url), self.parse_titles) def parse_titles(self, response): for post_title in response.css('div.entries > ul > li a::text').extract(): yield {'title': post_title} EOF $ scrapy runspider myspider.py http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt

Create a Scrapy Project $ scrapy startproject sushibot $ tree sushibot/ sushibot/ !"" scrapy.cfg #"" sushibot !"" __init__.py !"" items.py !"" pipelines.py !"" settings.py #"" spiders #"" __init__.py 2 directories, 6 files

Generate a Spider $ cd sushibot $ scrapy genspider sushi api.flickr.com $ cat sushibot/spiders/sushi.py # -*- coding: utf-8 -*- import scrapy class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com"] start_urls = ( 'http://www.api.flickr.com/', ) def parse(self, response): pass

Flickr API to Search Photos $ curl 'https://api.flickr.com/services/rest/? method=flickr.photos.search&api_key=******&text=sushi&sort=relevance ' > photos.xml $ cat photos.xml <?xml version="1.0" encoding="utf-8" ?> <rsp stat="ok"> <photos page="1" pages="871" perpage="100" total="87088"> <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> <photo id="8486536177" owner="78779574@N00" secret="f77b824ebb" server="8382" farm="9" title="Best Salmon Sushi" ispublic="1" isfriend="0" isfamily="0" /> ... https://www.ﬂickr.com/services/api/ﬂickr.photos.search.html

Construct Photo's URL <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret} _[mstzb].jpg https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg https://www.ﬂickr.com/services/api/misc.urls.html Photo element: Photo's URL template: Result:

spider/sushi.py (Modiﬁed) # -*- coding: utf-8 -*- import os import scrapy from sushibot.items import SushibotItem class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com", "staticflickr.com"] start_urls = ( 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' + os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance', ) def parse(self, response): for photo in response.css('photo'): yield scrapy.Request(photo_url(photo), self.handle_image) def handle_image(self, response): return SushibotItem(url=response.url, body=response.body) def photo_url(photo): return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format( farm=photo.xpath('@farm').extract_first(), server=photo.xpath('@server').extract_first(), id=photo.xpath('@id').extract_first(), secret=photo.xpath('@secret').extract_first(), size='b', )

Scrapy's Architecture http://doc.scrapy.org/en/1.0/topics/architecture.html

items.py # -*- coding: utf-8 -*- from pprint import pformat import scrapy class SushibotItem(scrapy.Item): url = scrapy.Field() body = scrapy.Field() def __str__(self): return pformat({ 'url': self['url'], 'body': self['body'][:10] + '...', })

pipelines.py # -*- coding: utf-8 -*- import os class SaveImagePipeline(object): def process_item(self, item, spider): output_dir = 'images' if not os.path.exists(output_dir): os.makedirs(output_dir) filename = item['url'].split('/')[-1] with open(os.path.join(output_dir, filename), 'wb') as f: f.write(item['body']) return item

settings.py • Appended settings: # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'sushibot (+orangain@gmail.com)' # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/ settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'sushibot.pipelines.SaveImagePipeline': 300, }

Run Spider $ FLICKR_KEY=********** scrapy crawl sushi NOTE: Provide Flickr's API key with environment variables.

Thank you! • Web scraping has power to propose improvement. • Source code is available at  https://github.com/orangain/sushibot @orangain

Web Scraping in Python with Scrapy

More Related Content

What's hot

Viewers also liked

Similar to Web Scraping in Python with Scrapy

Recently uploaded

Web Scraping in Python with Scrapy