I need to do some three-step web scraping in Python. I have a couple base pages that I scrape initially, and I need to get a few select links off those pages and retrieve the pages they point to, and repeat that one more time. The trick is I would like to do this all asynchronously, so that every request is fired off as soon as possible, and the whole application isn't blocked on a single request. How would I do this?
Up until this point, I've been doing one-step scraping with eventlet, like this:
urls = ['http://example.com', '...'] def scrape_page(url): """Gets the data from the web page.""" body = eventlet.green.urllib2.urlopen(url).read() # Do something with body return data pool = eventlet.GreenPool() for data in pool.imap(screen_scrape, urls): # Handle the data... However, if I extend this technique and include a nested GreenPool.imap loop, it blocks until all the requests in that group are done, meaning the application can't start more requests as needed.
I know I could do this with Twisted or another asynchronous server, but I don't need such a huge library and I would rather use something lightweight. I'm open to suggestions, though.