Skip to main content
2 of 2
deleted 1 character in body
dmitryro
  • 3.5k
  • 2
  • 23
  • 30

Here is a minimal example to demonstrate how to use concurrent.futures for parallel processing. This does not include the actual scraping logic as you can add it yourself, if needed, but demonstrates the pattern to follow:

from concurrent import futures from concurrent.futures import ThreadPoolExecutor def scrape_func(*args, **kwargs): """ Stub function to use with futures - your scraping logic """ print("Do something in parallel") return "result scraped" def main(): start_date = 'YYYY-MM-DD' end_date = 'YYYY-MM-DD' idx = pd.date_range(start_date,end_date) date_range = [d.strftime('%Y-%m-%d') for d in idx] max_retries_min_sleeptime = 300 max_retries_max_sleeptime = 600 min_sleeptime = 150 max_sleeptime = 250 # The important part - concurrent futures # - set number of workers as the number of jobs to process with ThreadPoolExecutor(len(date_range)) as executor: # Use list jobs for concurrent futures # Use list scraped_results for results jobs = [] scraped_results = [] for date in date_range: # Pass some keyword arguments if needed - per job kw = {"some_param": "value"} # Here we iterate 'number of dates' times, could be different # We're adding scrape_func, could be different function per call jobs.append(executor.submit(scrape_func, **kw)) # Once parallell processing is complete, iterate over results for job in futures.as_completed(jobs): # Read result from future scraped_result = job.result() # Append to the list of results scraped_results.append(scraped_result) # Iterate over results scraped and do whatever is needed for result is scraped_results: print("Do something with me {}".format(result)) if __name__=="__main__": main() 

As mentioned, this is just to demonstrate the pattern to follow, the rest should be straightforward.

dmitryro
  • 3.5k
  • 2
  • 23
  • 30