4

Just being noob in this context:

I am try to run one function in multiple processes so I can process a huge file in shorter time

I tried

for file_chunk in file_chunks: p = Process(target=my_func, args=(file_chunk, my_arg2)) p.start() # without .join(), otherwise main proc has to wait # for proc1 to finish so it can start proc2 

but it seemed not so really fast enough

now I ask myself, if it is really running the jobs parallelly. I thought about Pool also, but I am using python2 and it is ugly to make it map two arguments to the function.

am I missing something in my code above or the processes that are created this way (like above) run really paralelly?

2
  • 1
    How many chunks do you expect to be processed? Are you spinning up hundreds of new processes here? A pool allows you to create a set number of workers and divide the set of chunks (tasks) to the workers without overloading your system. Commented May 23, 2017 at 15:50
  • @svohara I have 20 chunks only so I can achieve 20 times faster processing which is not happening with this code Commented May 23, 2017 at 15:52

2 Answers 2

9

The speedup is proportional to the amount of CPU cores your PC has, not the amount of chunks.

Ideally, if you have 4 CPU cores, you should see a 4x speedup. Yet other factors such as IPC overhead must be taken into account when considering the performance improvement.

Spawning too many processes will also negatively affect your performance as they will compete against each other for the CPU.

I'd recommend to use a multiprocessing.Pool to deal with most of the logic. If you have multiple arguments, just use the apply_async method.

from multiprocessing import Pool pool = Pool() for file_chunk in file_chunks: pool.apply_async(my_func, args=(file_chunk, arg1, arg2)) 
Sign up to request clarification or add additional context in comments.

Comments

3

I am not an expert either, but what you should try is using joblib Parallel

from joblib import Parallel, delayed import multiprocessing as mp def random_function(args): pass proc = mp.cpu_count() Parallel(n_jobs=proc)(delayed(random_function)(args) for args in args_list) 

This will run a certain function (random_function) using a number of available cpus (n_jobs).

Feel free to read the docs!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.