Python scraping data parallel requests with urllib2 [closed]

Question

I want to scrap a site. There are about 8000 items to scrap. I have problem that if it takes 1 second to request for 1 item then it will take about 8000 seconds for these items which means it takes about 134 mints and 2.5 hours. Can anyone help about how to make it done and do the multi requests at the same time. I am using python urllib2 for requesting the contents.

If you do that, you'll likely get banned from the site you're trying to scrap [sic]. Did you read their Terms of Use? Is it OK with them if you scrap [sic] their site? — Robert Harvey
– Robert Harvey, Commented Feb 18, 2014 at 17:28
yes, they allow scraping. I just need the answer of my scenario. — user3324557
– user3324557, Commented Feb 18, 2014 at 17:38
Look into using python scraping tools, like beautiful soup or scrappy. I know scrappy can create multiple spiders and launch them to scrape urls at the same time [12 spiders at once default]. — Ryan G
– Ryan G, Commented Feb 18, 2014 at 17:52

Community · Accepted Answer · 2017-05-23 10:27:45Z

Use better HTTP client. Urllib2 makes requests with Connection: close, so always new TCP connection has to be negotiated. With requests, you can reuse that TCP connections.
```
s = requests.Session() r = s.get("http://example.org") 
```
Make requests in parallel. Since this is I/O-bound it is OK with GIL and you can use threads. You can run a few simple threads that download a batch of URLs and then wait for all of them to finish. But maybe something like "parallel map" would fit this better - I found this answer with simple example:

https://stackoverflow.com/a/3332884/196206

If you are sharing anything between threads, make sure it is thread safe - request session object seems to be thread safe: https://stackoverflow.com/a/20457621/196206

Update - a small example:

#!/usr/bin/env python import lxml.html import requests import multiprocessing.dummy import threading first_url = "http://api.stackexchange.com/2.2/questions?pagesize=10&order=desc&sort=activity&site=stackoverflow" rs = requests.session() r = rs.get(first_url) links = [item["link"] for item in r.json()["items"]] lock = threading.Lock() def f(data): n, link = data r = rs.get(link) doc = lxml.html.document_fromstring(r.content) names = [el.text for el in doc.xpath("//div[@class='user-details']/a")] with lock: print("%s. %s" % (n+1, link)) print(", ".join(names)) print("---") # you can also return value, they will be returned # from pool.map() in order corresponding to the links return (link, names) pool = multiprocessing.dummy.Pool(5) names_list = pool.map(f, enumerate(links)) print(names_list)

thanx for the quick response. If I got 6 responses at the same time then how to use them with lxml to extract the data same time as they are going to be one by one in current scenario and put them on file concurrently.
Write the lxml processing together with downloading to that function called by parallel map. You can also write to a file there, but use a lock (docs.python.org/2/library/threading.html#lock-objects) to exclude parallel file writes.

Mark Chackerian · Accepted Answer · 2014-02-18 18:29:11Z

You should consider using Scrapy instead of working directly with lxml and urllib. Scrapy is "a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages." It's built on top of Twisted so that it can be inherently asynchronous, and as a result it is very very FAST.

I can't give you any specific numbers on how much faster your scraping will go, but imagine that your requests are happening in parallel instead of serially. You'll still need to write the code to extract the information that you want, using xpath or Beautiful Soup, but you won't have to work out the fetching of pages.

Though parallel requests are obviously faster, do keep in mind that different scraping targets will have different reactions to aggressive scraping. A scraping target can make your life quite difficult if they care to (if you don't cover your tracks, possibly even legally), so be sure to respect their wishes to the extent possible.

Collectives™ on Stack Overflow

Python scraping data parallel requests with urllib2 [closed]

2 Answers 2

2 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Linked

Related