How to parallelize a nested for loop of the Python?

Question

This is a loop code I have for my final year project. It works fine but it executes slower. I need to make this faster. It is a nested for loop so is there a method to parallelize this code?

 for user in users.each(): count += 1 if not os.path.isdir("Facial_images/face_rec/train/User_" + str(count)): os.makedirs("Facial_images/face_rec/train/User_" + str(count)) for i in range(20): DBHelper.download_user_photo("User_" + str(count) + "/" + str(i) + ".jpg")

What you're looking for is multithreading. See stackoverflow.com/questions/2846653/…. — hostingutilities.com
– hostingutilities.com, Commented Nov 15, 2020 at 17:34
Looking at the code snippet it is going to be IO bound waiting on the photo downloads. Multiprocessing will work but will take more resources than multi-threading and probably be marginally faster, in fact depending on what download_user_photo does it could be slower (if the downloads are written to disk it will be faster if they go into memory the processes will have to communicate results back to the main process.) asyncio will likely be the most efficient approach. see stackoverflow.com/questions/61105464/… — David Oldford
– David Oldford, Commented Nov 15, 2020 at 17:41

David Oldford · Accepted Answer · 2020-11-15 17:55:03Z

There are three major methods available to you. multiprocessing, multithreading and coroutines. There are multiple ways to do each and in the case of coroutines multiple names for the techniques.

Multiprocessing you'll create a new process for each download, multithreading you'll create a new thread. Threads are lighter weight than processes (but not by much,) and share memory. So it costs less for the OS to switch between threads and you can access the same data in each without copying it or sending in messages. The downside of threads is you need to control access to shared memory to prevent race conditions and in the case of Python threads will never actually run in parallel due to them sharing one interpreter which has a Global Interpreter Lock (GIL,) to prevent such conditions occurring in the interpreter itself.

With multiprocessing this isn't an issue as every process has its own interpreter and its own GIL. Each process also has its own memory space so you would have to setup any shared memory you'd want explicitly (though that doesn't really apply to your problem.) As each process has its own memory and interpreter it takes longer to switch between processes than it does threads.

Coroutines are sort of like even lighter weight threads that live within a single thread. You basically do as much work as you can until you'd be waiting on Input/Output (IO) and then save the state of that routine somehow (automatically,) and move on to other work that you can do. Every time a coroutine runs out of useful work to do or starts waiting on IO your program goes back and checks what pending coroutines are now able to do work based on IO operations that have completed (theoretically there could be other times this check occurs it would be specific to an implementation of coroutines.) This way you have the least cost associated with switching contexts (ie between threads or processes,) and you use the least memory. Your program in this case never actually does anything in parallel but the operating system is performing parallel IO operations on your behalf.

the multiprocessing library will give you multiprocessing, threading will give you multithreading (I suggest using a Pool or ThreadingPool if you choose one of these,) and asyncio will give you coroutine support. see here for a tutorial which matches very closely to what you want.

Of course even if you make these operations parallel they may not be faster depending on where the files are being downloaded from and how much bandwidth that server is going to allocate to your downloads, ie if you get 100kbps for one download will you just get 50kbps each for two? It would still be a bit faster most likely as you'd be doing TCP handshakes in parallel.

Collectives™ on Stack Overflow

How to parallelize a nested for loop of the Python?

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related