I want to scrap a site. There are about 8000 items to scrap. I have problem that if it takes 1 second to request for 1 item then it will take about 8000 seconds for these items which means it takes about 134 mints and 2.5 hours. Can anyone help about how to make it done and do the multi requests at the same time. I am using python urllib2 for requesting the contents.
- If you do that, you'll likely get banned from the site you're trying to scrap [sic]. Did you read their Terms of Use? Is it OK with them if you scrap [sic] their site?Robert Harvey– Robert Harvey2014-02-18 17:28:40 +00:00Commented Feb 18, 2014 at 17:28
- yes, they allow scraping. I just need the answer of my scenario.user3324557– user33245572014-02-18 17:38:01 +00:00Commented Feb 18, 2014 at 17:38
- Look into using python scraping tools, like beautiful soup or scrappy. I know scrappy can create multiple spiders and launch them to scrape urls at the same time [12 spiders at once default].Ryan G– Ryan G2014-02-18 17:52:03 +00:00Commented Feb 18, 2014 at 17:52
2 Answers
Use better HTTP client. Urllib2 makes requests with
Connection: close, so always new TCP connection has to be negotiated. Withrequests, you can reuse that TCP connections.s = requests.Session() r = s.get("http://example.org")Make requests in parallel. Since this is I/O-bound it is OK with GIL and you can use threads. You can run a few simple threads that download a batch of URLs and then wait for all of them to finish. But maybe something like "parallel map" would fit this better - I found this answer with simple example:
https://stackoverflow.com/a/3332884/196206
If you are sharing anything between threads, make sure it is thread safe - request session object seems to be thread safe: https://stackoverflow.com/a/20457621/196206
Update - a small example:
#!/usr/bin/env python import lxml.html import requests import multiprocessing.dummy import threading first_url = "http://api.stackexchange.com/2.2/questions?pagesize=10&order=desc&sort=activity&site=stackoverflow" rs = requests.session() r = rs.get(first_url) links = [item["link"] for item in r.json()["items"]] lock = threading.Lock() def f(data): n, link = data r = rs.get(link) doc = lxml.html.document_fromstring(r.content) names = [el.text for el in doc.xpath("//div[@class='user-details']/a")] with lock: print("%s. %s" % (n+1, link)) print(", ".join(names)) print("---") # you can also return value, they will be returned # from pool.map() in order corresponding to the links return (link, names) pool = multiprocessing.dummy.Pool(5) names_list = pool.map(f, enumerate(links)) print(names_list) 2 Comments
lxml processing together with downloading to that function called by parallel map. You can also write to a file there, but use a lock (docs.python.org/2/library/threading.html#lock-objects) to exclude parallel file writes.You should consider using Scrapy instead of working directly with lxml and urllib. Scrapy is "a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages." It's built on top of Twisted so that it can be inherently asynchronous, and as a result it is very very FAST.
I can't give you any specific numbers on how much faster your scraping will go, but imagine that your requests are happening in parallel instead of serially. You'll still need to write the code to extract the information that you want, using xpath or Beautiful Soup, but you won't have to work out the fetching of pages.