11

Possible Duplicate:
Multiple (asynchronous) connections with urllib2 or other http library?

I am working on a Linux web server that runs Python code to grab realtime data over HTTP from a 3rd party API. The data is put into a MySQL database. I need to make a lot of queries to a lot of URL's, and I need to do it fast (faster = better). Currently I'm using urllib3 as my HTTP library. What is the best way to go about this? Should I spawn multiple threads (if so, how many?) and have each query for a different URL? I would love to hear your thoughts about this - thanks!

1
  • There is a new answer that I can't add because this question was closed. The best way to do this today is using requests-futures github.com/ross/requests-futures Commented Jun 22, 2018 at 18:05

3 Answers 3

30

If a lot is really a lot than you probably want use asynchronous io not threads.

requests + gevent = grequests

GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.

import grequests urls = [ 'http://www.heroku.com', 'http://tablib.org', 'http://httpbin.org', 'http://python-requests.org', 'http://kennethreitz.com' ] rs = (grequests.get(u) for u in urls) grequests.map(rs) 
Sign up to request clarification or add additional context in comments.

7 Comments

I want to use this method for sending requests to about 50,000 urls. Is it a good strategy? Also, what about exceptions like timeout etc?
@John Yes, it is. As to exceptions see safe_mode parameter and issue 953
I can't send more than 30 requests using grequest. When I do, I get "Max retries exceeded with url: ..., Too many open files". Is there anyway to fix this problem?
Word of warning: grequests seems to be abandoned, and does not have error handling. My personal recommendation is github.com/ross/requests-futures , which is equally fast and, with backports, also works on 2.7.
@droope it doesn't look like grequests is abandoned, and it seems easier to run on python_ver < 3.4. Do you have a link to the backports package you're talking about? This is the most popular package I see: pypi.python.org/pypi/backports.ssl_match_hostname
|
1

You should use multithreading as well as pipelining requests. For example search->details->save

The number of threads you can use doesn't depend on your equipment only. How many requests the service can serve? How many concurrent requests does it allow to run? Even your bandwidth can be a bottleneck.

If you're talking about a kind of scraping - the service could block you after certain limit of requests, so you need to use proxies or multiple IP bindings.

As for me, in the most cases, I can run 50-300 concurrent requests on my laptop from python scripts.

3 Comments

Agree with Polscha, here. Most of the time, when you're making HTTP requests to an arbitrary service, most of the (clock) time expended is in waiting for for the network and the remote service to respond. So, within reason, the more threads, the better as at any given moment, most of those threads will just be in wait queues. Definitely heed Polscha's notes on service throttling.
thanks guys - the service is commercial and we are paying for it. it is very fast and will not be the bottleneck. in this case, what would be the best option?
@user1094786 In this case just try to build a pipeline of requests and experiment with a number of threads on each stage. Just try, sooner or later you'll found the upper limit :-)
0

Sounds like an excellent application for Twisted. Here are some web-related examples, including how to download a web page. Here is a related question on database connections with Twisted.

Note that Twisted does not rely on threads for doing multiple things at once. Rather, it takes a cooperative multitasking approach---your main script starts the reactor and the reactor calls functions that you set up. Your functions must return control to the reactor before the reactor can continue working.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.