I'm working on a script that autonomously scrapes historical data from several websites and saves them to the same excel file for each past date that's within a specified date range. Each individual function accesses several webpages from a different website, formats the data, and writes it to the file on separate sheets. Because I am continuously making requests on these sites, I make sure that I add ample sleep time between requests. Instead of running these functions one after another, is there a way that I could run them together?
I want to make one request with Function 1, then make one request with Function 2, and so on until all functions have made one request. After all functions have made a request, I would like it to loop back and complete the second request within each function (and so on) until all requests for a given date are complete. Doing this would allow the same amount of sleep time between requests on each website while decreasing the time the code would take to run by a large amount. One point to note is that each function makes a slightly different number of HTTP requests. For instance, Function 1 may make 10 requests on a given date while Function 2 makes 8 requests, Function 3 makes 8, Function 4 makes 7, and Function 5 makes 10.
I've read into this topic and have read about multithreading, but I am unsure how to apply this to my specific scenario. If there is no way to do this, I could run each function as its own code and run them at the same time, but then I would have to concatenate five different excel files for each date, which is why I am trying to do it this way.
start_date = 'YYYY-MM-DD' end_date = 'YYYY-MM-DD' idx = pd.date_range(start_date,end_date) date_range = [d.strftime('%Y-%m-%d') for d in idx] max_retries_min_sleeptime = 300 max_retries_max_sleeptime = 600 min_sleeptime = 150 max_sleeptime = 250 for date in date_range: writer = pd.ExcelWriter('Daily Data -' + date + '.xlsx') Function1() Function2() Function3() Function4() Function5() writer.save() print('Date Complete: ' + date) time.sleep(random.randrange(min_sleeptime,max_sleeptime,1))
aiohttpaiohttp.readthedocs.io/en/stable