1

I'm working on a script that autonomously scrapes historical data from several websites and saves them to the same excel file for each past date that's within a specified date range. Each individual function accesses several webpages from a different website, formats the data, and writes it to the file on separate sheets. Because I am continuously making requests on these sites, I make sure that I add ample sleep time between requests. Instead of running these functions one after another, is there a way that I could run them together?

I want to make one request with Function 1, then make one request with Function 2, and so on until all functions have made one request. After all functions have made a request, I would like it to loop back and complete the second request within each function (and so on) until all requests for a given date are complete. Doing this would allow the same amount of sleep time between requests on each website while decreasing the time the code would take to run by a large amount. One point to note is that each function makes a slightly different number of HTTP requests. For instance, Function 1 may make 10 requests on a given date while Function 2 makes 8 requests, Function 3 makes 8, Function 4 makes 7, and Function 5 makes 10.

I've read into this topic and have read about multithreading, but I am unsure how to apply this to my specific scenario. If there is no way to do this, I could run each function as its own code and run them at the same time, but then I would have to concatenate five different excel files for each date, which is why I am trying to do it this way.

start_date = 'YYYY-MM-DD' end_date = 'YYYY-MM-DD' idx = pd.date_range(start_date,end_date) date_range = [d.strftime('%Y-%m-%d') for d in idx] max_retries_min_sleeptime = 300 max_retries_max_sleeptime = 600 min_sleeptime = 150 max_sleeptime = 250 for date in date_range: writer = pd.ExcelWriter('Daily Data -' + date + '.xlsx') Function1() Function2() Function3() Function4() Function5() writer.save() print('Date Complete: ' + date) time.sleep(random.randrange(min_sleeptime,max_sleeptime,1)) 
1

3 Answers 3

1

Using Python3.6

Here is a minimal example of concurrent requests with aiohttp to get you started (docs). This example runs 3 downloader's at the same time, appending the rsp to responses. I believe you will be able to adapt this idea to your problem.

import asyncio from aiohttp.client import ClientSession async def downloader(session, iter_url, responses): while True: try: url = next(iter_url) except StopIteration: return rsp = await session.get(url) if not rsp.status == 200: continue # < - Or raise error responses.append(rsp) async def run(urls, responses): with ClientSession() as session: iter_url = iter(urls) await asyncio.gather(*[downloader(session, iter_url, responses) for _ in range(3)]) urls = [ 'https://stackoverflow.com/questions/tagged/python', 'https://aiohttp.readthedocs.io/en/stable/', 'https://docs.python.org/3/library/asyncio.html' ] responses = [] loop = asyncio.get_event_loop() loop.run_until_complete(run(urls, responses)) 

Result:

>>> responses [<ClientResponse(https://docs.python.org/3/library/asyncio.html) [200 OK]> <CIMultiDictProxy('Server': 'nginx', 'Content-Type': 'text/html', 'Last-Modified': 'Sun, 28 Jan 2018 05:08:54 GMT', 'ETag': '"5a6d5ae6-6eae"', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains; preload', 'Via': '1.1 varnish', 'Fastly-Debug-Digest': '79eb68156ce083411371cd4dbd0cb190201edfeb12e5d1a8a1e273cc2c8d0e41', 'Content-Length': '28334', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '66775', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2140-IAD, cache-mel6520-MEL', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 1', 'X-Timer': 'S1517183297.337465,VS0,VE1')> , <ClientResponse(https://stackoverflow.com/questions/tagged/python) [200 OK]> <CIMultiDictProxy('Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'SAMEORIGIN', 'X-Request-Guid': '3fb98f74-2a89-497d-8d43-322f9a202775', 'Strict-Transport-Security': 'max-age=15552000', 'Content-Length': '23775', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-mel6520-MEL', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1517183297.107658,VS0,VE265', 'Vary': 'Accept-Encoding,Fastly-SSL', 'X-DNS-Prefetch-Control': 'off', 'Set-Cookie': 'prov=8edb36d8-8c63-bdd5-8d56-19bf14916c93; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly', 'Cache-Control': 'private')> , <ClientResponse(https://aiohttp.readthedocs.io/en/stable/) [200 OK]> <CIMultiDictProxy('Server': 'nginx/1.10.3 (Ubuntu)', 'Date': 'Sun, 28 Jan 2018 23:48:18 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Wed, 17 Jan 2018 08:45:22 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'ETag': 'W/"5a5f0d22-578a"', 'X-Subdomain-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Content-Encoding': 'gzip')> ] 
Sign up to request clarification or add additional context in comments.

Comments

1

Here is a minimal example to demonstrate how to use concurrent.futures for parallel processing. This does not include the actual scraping logic as you can add it yourself, if needed, but demonstrates the pattern to follow:

from concurrent import futures from concurrent.futures import ThreadPoolExecutor def scrape_func(*args, **kwargs): """ Stub function to use with futures - your scraping logic """ print("Do something in parallel") return "result scraped" def main(): start_date = 'YYYY-MM-DD' end_date = 'YYYY-MM-DD' idx = pd.date_range(start_date,end_date) date_range = [d.strftime('%Y-%m-%d') for d in idx] max_retries_min_sleeptime = 300 max_retries_max_sleeptime = 600 min_sleeptime = 150 max_sleeptime = 250 # The important part - concurrent futures # - set number of workers as the number of jobs to process with ThreadPoolExecutor(len(date_range)) as executor: # Use list jobs for concurrent futures # Use list scraped_results for results jobs = [] scraped_results = [] for date in date_range: # Pass some keyword arguments if needed - per job kw = {"some_param": "value"} # Here we iterate 'number of dates' times, could be different # We're adding scrape_func, could be different function per call jobs.append(executor.submit(scrape_func, **kw)) # Once parallell processing is complete, iterate over results for job in futures.as_completed(jobs): # Read result from future scraped_result = job.result() # Append to the list of results scraped_results.append(scraped_result) # Iterate over results scraped and do whatever is needed for result is scraped_results: print("Do something with me {}".format(result)) if __name__=="__main__": main() 

As mentioned, this is just to demonstrate the pattern to follow, the rest should be straightforward.

Comments

0

Thanks for the responses guys! As it turns out a pretty simple block of code from this other question (Make 2 functions run at the same time) seems to do what I want.

import threading from threading import Thread def func1(): print 'Working' def func2(): print 'Working' if __name__ == '__main__': Thread(target = func1).start() Thread(target = func2).start() 

4 Comments

If threading solves your problem, do it. You might re-visit these examples, if you out grow your current solution.
Just out of curiosity, what do you mean by outgrow my solution? Is threading a more CPU intensive process or is it only suitable for a small number or functions to be efficient or something? I'm new to this specific concept so I'm learning as I go.
I can't give that question justice. The answer is moderately nuanced. This talk by Raymond Hettinger might be a good start: youtube.com/watch?v=Bv25Dwe84g0
Thank you for this, I found it quite interesting and informative. I'm going to implement the code with threading and see how it goes going forward.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.