0

I am using the requests module to download the content of many websites, which looks something like this:

import requests for i in range(1000): url = base_url + f"{i}.anything" r = requests.get(url) 

Of course this is simplified, but basically the base url is always the same, I only want to download an image, for example. This takes very long due to the amount of iterations. The internet connection is not the problem, but rather the amount of time it takes to start a request etc. So I was thinking about something like multiprocessing, because this task is basically always the same and I could imagine it to be a lot faster when multiprocessed.

Is this somehow doable? Thanks in advance!

1

1 Answer 1

1

I would suggest that in this case, the lightweight thread would be better. When I ran the request on a certain URL 5 times, the result was:

Threads: Finished in 0.24 second(s) MultiProcess: Finished in 0.77 second(s) 

Your implementation can be something like this:

import concurrent.futures import requests from bs4 import BeautifulSoup import time def access_url(url,No): print(f"{No}:==> {url}") response=requests.get(url) soup=BeautifulSoup(response.text,features='lxml') return ("{} : {}".format(No, str(soup.title)[7:50])) if __name__ == "__main__": test_url="http://bla bla.com/" base_url=test_url THREAD_MULTI_PROCESSING= True start = time.perf_counter() # calculate the time url_list=[base_url for i in range(5)] # setting parameter for function as a list so map can be used. url_counter=[i for i in range(5)] # setting parameter for function as a list so map can be used. if THREAD_MULTI_PROCESSING: with concurrent.futures.ThreadPoolExecutor() as executor: # In this case thread would be better results = executor.map(access_url,url_list,url_counter) for result in results: print(result) end = time.perf_counter() # calculate finish time print(f'Threads: Finished in {round(end - start,2)} second(s)') start = time.perf_counter() PROCESS_MULTI_PROCESSING=True if PROCESS_MULTI_PROCESSING: with concurrent.futures.ProcessPoolExecutor() as executor: results = executor.map(access_url,url_list,url_counter) for result in results: print(result) end = time.perf_counter() print(f'Threads: Finished in {round(end - start,2)} second(s)') 

I think you will see better performance in your case.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much, I actually chose Threads after all, great example!
You are welcome! agree, Threads seems a good fit here if have only one processor.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.