speed up a HTTP request python and 500 error

Question

I have a code that retrieves news results from this newspaper using a query and a time frame (could be up to a year).

The results are paginated up to 10 articles per page and since I couldn't find a way to increase it, I issue a request for each page then retrieve the title, url and date of each article. Each cycle (the HTTP request and the parsing) takes from 30 seconds to a minute and that's extremely slow. And eventually it will stop with a response code of 500. I am wondering if there is ways to speed it up or maybe make multiple requests at once. I simply want to retrieve the articles details in all the pages. Here is the code:

 import requests import re from bs4 import BeautifulSoup import csv URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0' def run(**params): countryFile = open("EgyptDaybyDay.csv","a") i=1 results = True while results: params["index"]=str(i) response = requests.get(URL.format(**params)) print response.status_code htmlFile = BeautifulSoup(response.content) articles = htmlFile.findAll("div", { "class" : "newslist" }) for article in articles: url = (article.a['href']).encode('utf-8','ignore') title = (article.img['alt']).encode('utf-8','ignore') dateline = article.find("div",{"class": "floatright"}) m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string) date = m.group(1) w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL) w.writerow((date, title, url )) if not articles: results = False i+=1 countryFile.close() run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")

baloo · Accepted Answer · 2013-03-24 19:47:27Z

This is a good opportunity to try out gevent.

You should have a separate routine for the request.get part so that your application doesn't have to wait for IO blocking.

You can then spawn multiple workers and have queues to pass requests and articles around. Maybe something similar to this:

import gevent.monkey from gevent.queue import Queue from gevent import sleep gevent.monkey.patch_all() MAX_REQUESTS = 10 requests = Queue(MAX_REQUESTS) articles = Queue() mock_responses = range(100) mock_responses.reverse() def request(): print "worker started" while True: print "request %s" % requests.get() sleep(1) try: articles.put('article response %s' % mock_responses.pop()) except IndexError: articles.put(StopIteration) break def run(): print "run" i = 1 while True: requests.put(i) i += 1 if __name__ == '__main__': for worker in range(MAX_REQUESTS): gevent.spawn(request) gevent.spawn(run) for article in articles: print "Got article: %s" % article

you could also do that with twisted python and a list of deferred events
I realize now that the iteration may stop before the actual last article is found. But you get the idea

Matti Lyra · Accepted Answer · 2013-03-24 18:57:18Z

The most probably slow down is the server, so parallelising the http requests would be the best way to go about making the code run faster, although there's very little you can do to speed up the server response. There's a good tutorial over at IBM for doing exactly this

Dan Lecocq · Accepted Answer · 2013-03-26 20:49:49Z

It seems to me that you're looking for a feed, which that newspaper doesn't advertise. However, it's a problem that has been solved before -- there are many sites that will generate feeds for you for an arbitrary website thus at least solving one of your problems. Some of these require some human guidance, and others have less opportunity for tweaking and are more automatic.

If you can at all avoid doing the pagination and parsing yourself, I'd recommend it. If you cannot, I second the use of gevent for simplicity. That said, if they're sending you back 500's, your code is likely less of an issue and added parallelism may not help.

dusual · Accepted Answer · 2013-03-27 10:51:10Z

You can try making all the calls asynchronously .

Have a look at this : http://pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html

You could use gevent as well rather than twisted but just telling you the options.

Community · Accepted Answer · 2017-05-23 12:10:49Z

This might very well come close to what you're looking for.

Ideal method for sending multiple HTTP requests over Python? [duplicate]

Source code: https://github.com/kennethreitz/grequests

Collectives™ on Stack Overflow

speed up a HTTP request python and 500 error

5 Answers 5

2 Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

Comments

Comments

Comments

Linked

Related