Using multiprocessing to speed up website requests in Python

Question

I am using the requests module to download the content of many websites, which looks something like this:

import requests for i in range(1000): url = base_url + f"{i}.anything" r = requests.get(url)

Of course this is simplified, but basically the base url is always the same, I only want to download an image, for example. This takes very long due to the amount of iterations. The internet connection is not the problem, but rather the amount of time it takes to start a request etc. So I was thinking about something like multiprocessing, because this task is basically always the same and I could imagine it to be a lot faster when multiprocessed.

Is this somehow doable? Thanks in advance!

This seems like a pretty generic multiprocessing question. Try starting with the following answers. stackoverflow.com/a/28463266/15013600 stackoverflow.com/a/26521507/15013600 — izzy18
– izzy18, Commented May 4, 2021 at 16:18

Dharman · Accepted Answer · 2021-05-04 16:25:33Z

I would suggest that in this case, the lightweight thread would be better. When I ran the request on a certain URL 5 times, the result was:

Threads: Finished in 0.24 second(s) MultiProcess: Finished in 0.77 second(s)

Your implementation can be something like this:

import concurrent.futures import requests from bs4 import BeautifulSoup import time def access_url(url,No): print(f"{No}:==> {url}") response=requests.get(url) soup=BeautifulSoup(response.text,features='lxml') return ("{} : {}".format(No, str(soup.title)[7:50])) if __name__ == "__main__": test_url="http://bla bla.com/" base_url=test_url THREAD_MULTI_PROCESSING= True start = time.perf_counter() # calculate the time url_list=[base_url for i in range(5)] # setting parameter for function as a list so map can be used. url_counter=[i for i in range(5)] # setting parameter for function as a list so map can be used. if THREAD_MULTI_PROCESSING: with concurrent.futures.ThreadPoolExecutor() as executor: # In this case thread would be better results = executor.map(access_url,url_list,url_counter) for result in results: print(result) end = time.perf_counter() # calculate finish time print(f'Threads: Finished in {round(end - start,2)} second(s)') start = time.perf_counter() PROCESS_MULTI_PROCESSING=True if PROCESS_MULTI_PROCESSING: with concurrent.futures.ProcessPoolExecutor() as executor: results = executor.map(access_url,url_list,url_counter) for result in results: print(result) end = time.perf_counter() print(f'Threads: Finished in {round(end - start,2)} second(s)')

I think you will see better performance in your case.

Thank you very much, I actually chose Threads after all, great example!
You are welcome! agree, Threads seems a good fit here if have only one processor.

Collectives™ on Stack Overflow

Using multiprocessing to speed up website requests in Python

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related