I have a code that retrieves news results from this newspaper using a query and a time frame (could be up to a year).
The results are paginated up to 10 articles per page and since I couldn't find a way to increase it, I issue a request for each page then retrieve the title, url and date of each article. Each cycle (the HTTP request and the parsing) takes from 30 seconds to a minute and that's extremely slow. And eventually it will stop with a response code of 500. I am wondering if there is ways to speed it up or maybe make multiple requests at once. I simply want to retrieve the articles details in all the pages. Here is the code:
import requests import re from bs4 import BeautifulSoup import csv URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0' def run(**params): countryFile = open("EgyptDaybyDay.csv","a") i=1 results = True while results: params["index"]=str(i) response = requests.get(URL.format(**params)) print response.status_code htmlFile = BeautifulSoup(response.content) articles = htmlFile.findAll("div", { "class" : "newslist" }) for article in articles: url = (article.a['href']).encode('utf-8','ignore') title = (article.img['alt']).encode('utf-8','ignore') dateline = article.find("div",{"class": "floatright"}) m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string) date = m.group(1) w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL) w.writerow((date, title, url )) if not articles: results = False i+=1 countryFile.close() run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")