Multi-step, concurrent HTTP requests in Python

Question

I need to do some three-step web scraping in Python. I have a couple base pages that I scrape initially, and I need to get a few select links off those pages and retrieve the pages they point to, and repeat that one more time. The trick is I would like to do this all asynchronously, so that every request is fired off as soon as possible, and the whole application isn't blocked on a single request. How would I do this?

Up until this point, I've been doing one-step scraping with eventlet, like this:

urls = ['http://example.com', '...'] def scrape_page(url): """Gets the data from the web page.""" body = eventlet.green.urllib2.urlopen(url).read() # Do something with body return data pool = eventlet.GreenPool() for data in pool.imap(screen_scrape, urls): # Handle the data...

However, if I extend this technique and include a nested GreenPool.imap loop, it blocks until all the requests in that group are done, meaning the application can't start more requests as needed.

I know I could do this with Twisted or another asynchronous server, but I don't need such a huge library and I would rather use something lightweight. I'm open to suggestions, though.

I really do recommend looking at twisted; it's true the library is huge, but you only need to use the http client part of it to do this, and having attempted a similar task both ways, the high-level library method is much easier. — Andrew Gorcester
– Andrew Gorcester, Commented Jul 16, 2012 at 3:44

jdi · Accepted Answer · 2012-07-16 05:44:42Z

Here is an idea... but forgive me since I don't know eventlet. I can only provide a rough concept.

Consider your "step 1" pool the producers. Create a queue and have your step 1 workers place any new urls they find into the queue.

Create another pool of workers. Have these workers pull from the queue for urls and process them. If during their process they discover another url, put that into the queue. They will keep feeding themselves with subsequent work.

Technically this approach would make it easily recursive beyond 1,2,3+ steps. As long as they find new urls and put them in the queue, the work keeps happening.

Better yet, start out with the original urls in the queue, and just create a single pool that puts new urls to that same queue. Only one pool needed.

Post note

Funny enough, after I posted this answer and went to look for what the eventlet 'queue' equivalent was, I immediately found an example showing exactly what I just described:

http://eventlet.net/doc/examples.html#producer-consumer-web-crawler

In that example there is a producer and fetch method. The producer starts pulling urls from the queue and spawning threads to fetch. fetch then puts any new urls back into the queue and they keep feeding each other.

Oh, awesome, I don't know how I missed that example. Let me try this out and I will get back to you, but it looks like it does exactly what I need. Thank you!
Forgot to respond, but it worked like a charm! Only problem is eventlet's green thread limit is fairly low on Windows, but that's not that big a problem. Thanks again!

Collectives™ on Stack Overflow

Multi-step, concurrent HTTP requests in Python

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related