0

Below is a program that makes multiple get requests and writes the response images to my directory. These get requests are meant to be in separate threads, and thus be quicker than w/o threads but I'm not seeing the performance difference.

Printing active_count() shows there are 9 threads created. However, the performance time still takes around 40 seconds whether or not I use threading.

Below is me using threading.

from threading import active_count import requests import time import concurrent.futures img_urls = [ 'https://images.unsplash.com/photo-1516117172878-fd2c41f4a759', 'https://images.unsplash.com/photo-1532009324734-20a7a5813719', 'https://images.unsplash.com/photo-1524429656589-6633a470097c', 'https://images.unsplash.com/photo-1530224264768-7ff8c1789d79', 'https://images.unsplash.com/photo-1564135624576-c5c88640f235', 'https://images.unsplash.com/photo-1541698444083-023c97d3f4b6', 'https://images.unsplash.com/photo-1522364723953-452d3431c267', 'https://images.unsplash.com/photo-1513938709626-033611b8cc03', 'https://images.unsplash.com/photo-1507143550189-fed454f93097', 'https://images.unsplash.com/photo-1493976040374-85c8e12f0c0e', 'https://images.unsplash.com/photo-1504198453319-5ce911bafcde', 'https://images.unsplash.com/photo-1530122037265-a5f1f91d3b99', 'https://images.unsplash.com/photo-1516972810927-80185027ca84', 'https://images.unsplash.com/photo-1550439062-609e1531270e', 'https://images.unsplash.com/photo-1549692520-acc6669e2f0c' ] t1 = time.perf_counter() def download_image(img_url): img_bytes = requests.get(img_url).content img_name = img_url.split('/')[3] img_name = f'{img_name}.jpg' with open(img_name, 'wb') as img_file: img_file.write(img_bytes) print(f'{img_name} was downloaded...') with concurrent.futures.ThreadPoolExecutor() as executor: executor.map(download_image, img_urls) print(active_count()) t2 = time.perf_counter() print(f'Finished in {t2-t1} seconds') 

Below is without threading

def download_image(img_url): img_bytes = requests.get(img_url).content img_name = img_url.split('/')[3] img_name = f'{img_name}.jpg' with open(img_name, 'wb') as img_file: img_file.write(img_bytes) print(f'{img_name} was downloaded...') for img_url in img_urls: download_image(img_url) 

Could someone explain why this is happening? Thanks

6
  • 1
    Could you add start time, and end time of each iteration of download_image ? I'm pretty sure every of your download start at the same time, but will take much longer. The reason SHOULD be network related. I tried your piece of code, and it's working, i get around 10s increased (on a slow network) Commented Apr 20, 2022 at 8:10
  • Multi threading is not a python-good feature. Python GIL makes it impossible to run multiple threads to execute the same code parallel. Read more on GIL and multithreading to get an idea Commented Apr 20, 2022 at 8:11
  • Are you sure this is an issue of the code, for example that you aren't rate limited by the site you are downloading from? Commented Apr 20, 2022 at 8:14
  • 1
    @Kris I don't think it is correct to say "multi-threading is not a Python-good feature". It is a perfectly good and sensible choice when I/O is involved (because the GIL is released during I/O) which is exactly the case here. Commented Apr 20, 2022 at 8:33
  • The network is not multi-threaded. Your expectations are ill-founded. Commented Apr 20, 2022 at 8:39

2 Answers 2

1

I can see some performance improvement when using multiprocessing package.

import multiprocessing from multiprocessing import Pool def download_image(img_url: str) -> None: img_bytes = requests.get(img_url).content img_name = img_url.split('/')[3] img_name = f'{img_name}.jpg' with open(img_name, 'wb') as img_file: img_file.write(img_bytes) print(f'{img_name} was downloaded...') if __name__ == '__main__': t1 = time.perf_counter() with Pool(processes=multiprocessing.cpu_count() - 1 or 1) as pool: pool.map(download_image, img_urls) t2 = time.perf_counter() print(f'Finished in {t2 - t1} seconds') 
Sign up to request clarification or add additional context in comments.

Comments

0

This is the result i got with your piece of code, with start and end time next to the download. The overall time is around the same (on my "normal network", not the slow one i talked in my comment)

The reason is that multiple thread doesn't increase I/O or bandwith, the limitation could also be the website itself. This looks like the issue is not from your code.

EDIT (misleading statement) : as mentionned by MisterMiyagi in the comment below (read his comment, he explain why), it should increase I/O, that's the reason i get 10s increase on a slow network (limited connection on my work lab). This doesn't increase the I/O or bandwith in that specific case (with full bandwith on my "normal" connection), and this may be from a lot of source, but in my opinion, not the code itself.

I also tried with max_workers=5, the same overall time appears.

 photo-1516117172878-fd2c41f4a759.jpg was downloaded... 1.0464828 - 1.7136098 photo-1532009324734-20a7a5813719.jpg was downloaded... 1.7140197 - 5.6327612 photo-1524429656589-6633a470097c.jpg was downloaded... 5.6339666 - 8.3146478 photo-1530224264768-7ff8c1789d79.jpg was downloaded... 8.3160157 - 10.474087 photo-1564135624576-c5c88640f235.jpg was downloaded... 10.4749598 - 11.2431941 photo-1541698444083-023c97d3f4b6.jpg was downloaded... 11.2436369 - 15.6939695 photo-1522364723953-452d3431c267.jpg was downloaded... 15.6954112 - 18.3257819 photo-1513938709626-033611b8cc03.jpg was downloaded... 18.3269668 - 21.0607191 photo-1507143550189-fed454f93097.jpg was downloaded... 21.0621265 - 22.2371699 photo-1493976040374-85c8e12f0c0e.jpg was downloaded... 22.2375931 - 26.4375676 photo-1504198453319-5ce911bafcde.jpg was downloaded... 26.4393404 - 28.3477933 photo-1530122037265-a5f1f91d3b99.jpg was downloaded... 28.348679 - 30.4626719 photo-1516972810927-80185027ca84.jpg was downloaded... 30.4636931 - 32.2621345 photo-1550439062-609e1531270e.jpg was downloaded... 32.2628976 - 34.7331719 photo-1549692520-acc6669e2f0c.jpg was downloaded... 34.7341393 - 35.5910094 Finished in 34.545366900000005 seconds 21 photo-1516117172878-fd2c41f4a759.jpg was downloaded... 35.5960486 - 46.1692758 photo-1564135624576-c5c88640f235.jpg was downloaded... 35.6110777 - 47.3780254 photo-1507143550189-fed454f93097.jpg was downloaded... 35.6265503 - 47.4433963 photo-1549692520-acc6669e2f0c.jpg was downloaded... 35.6692061 - 49.7097683 photo-1516972810927-80185027ca84.jpg was downloaded... 35.6420564 - 57.2326763 photo-1504198453319-5ce911bafcde.jpg was downloaded... 35.6340008 - 61.4597509 photo-1550439062-609e1531270e.jpg was downloaded... 35.6637577 - 62.0488296 photo-1530224264768-7ff8c1789d79.jpg was downloaded... 35.6072146 - 63.4139648 photo-1513938709626-033611b8cc03.jpg was downloaded... 35.6223106 - 63.8149815 photo-1524429656589-6633a470097c.jpg was downloaded... 35.6032493 - 63.8284464 photo-1530122037265-a5f1f91d3b99.jpg was downloaded... 35.6352735 - 65.0513042 photo-1522364723953-452d3431c267.jpg was downloaded... 35.6182243 - 65.5005548 photo-1532009324734-20a7a5813719.jpg was downloaded... 35.5994888 - 66.2930857 photo-1541698444083-023c97d3f4b6.jpg was downloaded... 35.6144996 - 67.8115219 photo-1493976040374-85c8e12f0c0e.jpg was downloaded... 35.6301133 - 68.5357319 Finished in 32.946069800000004 seconds 

EDIT 2 (more testing) : I tried with one of my webserver (Same code, just different image list), and I got an overall decrease of 60-70% of downloading time. Work best with limited workers in that case. The problem come from the website, not your code.

4 Comments

"The reason is that multiple thread doesn't increase I/O or bandwith" that seems like quite a stretch: The major advantage of threading is to improve I/O bandwidth because the GIL is released during I/O operations so that multiple can run at once. This is especially the case for network I/O, which spends considerable time waiting due to network latency. The images seem hardly large enough to hit the actual network bandwidth limit.
@MisterMiyagi To be honest, my statement is not how it is supposed to work, but how it actually works. My best guess would be server side limitation, for whatever reason. I also tried to create a session with requests to see a difference, and there's no bandwith or I/O increase (in my case, i checked my ressource manager with multiple try). Might edit my answer if you tell me that my statement is misleading.
Ah, understood what you were trying to say. I took it as an explanation, not as an observation. Thanks for clarifying.
I edited the post, thanks for pointing out this (not a native english speaker).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.