1

I am running a function on multiple processes, which takes a dict of large pandas dataframes as input. When starting the processes, the dict is copied to each process, but as far as I understand, the dict only contais references to the dataframes, so the dataframes themselves are not copied to each process. Is this correct, or do each process get a deep copy of the dict?

import numpy as np from multiprocessing import Pool, Process, Manager def process_dataframe(df_dict, task_queue, result_queue): while True: try: test_value = task_queue.get(block=False) except: break else: result = {df_name: df[df==test_value].sum() for df_name, df in df_dict.items()} result_queue.put(result) if __name__ == '__main__': manager = Manager() task_queue = manager.Queue() result_queue = manager.Queue() df_dict = {'df_1': some_large_df1, 'df_2': some_large_df2, ...} test_values = np.random.rand(1000) for val in test_values: task_queue.put(val) with Pool(processes=4) as pool: processes = [] for _ in range(4): # Is df_dict copied shallow or deep to each process? p = pool.Process(target=process_dataframe, args=(df_dict,task_queue,result_queue)) processes.append(p) p.start() for p in processes: p.join() results = [result_queue.get(block=False) for _ in range(result_queue.qsize())] 
3
  • Where do you copy it? df_dict is never copied. I see it passed as a argument to pool.Process... Since it is passed as an argument the entire dict is passed as a pointer. So not even a shallow copy occurs. Commented Apr 9, 2019 at 1:52
  • You should also make your dict thread safe for this reason. Commented Apr 9, 2019 at 1:55
  • @Error-SyntacticalRemorse I read somewhere, that the arguments are pickled before being sent to each Process. Is it only the pointer that gets pickled then? Commented Apr 9, 2019 at 7:06

1 Answer 1

1

TLDR: It does pass a copy interesting enough. But not in the normal way. The child and parent processes share the same memory unless one or the other changes an object (on systems that implement copy-on-write [both windows and linux have this]). In which case memory is allocated for the object that was changed.

I am a firm believer that it is better to see something in action rather than just be told, that said, let's get into it.

I pulled some example multiprocessing code from online for this. The sample code fits the bill for answering this question but it does not match the code from your question.

All of the following code is one script but I am going to break it apart to explain each part.


Start the Example:

First lets create a dictionary. I will use this instead of a DataFrame since they act similar but I do not need to install a package to use it.

Note: The id() syntax, it returns the unique identity of an object

# importing the multiprocessing module import multiprocessing import sys # So we can see the memory we are using myDict = dict() print("My dict ID is:", id(myDict)) myList = [0 for _ in range(10000)] # Just a massive list of all 0s print('My list size in bytes:', sys.getsizeof(myList)) myDict[0] = myList print("My dict size with the list:", sys.getsizeof(myDict)) print("My dict ID is still:", id(myDict)) print("But if I copied my dic it would be:", id(myDict.copy())) 

For me this outputted:

My dict ID is: 139687265270016
My list size in bytes: 87624
My dict size with the list: 240
My dict ID is still: 139687265270016
But if I copied my dic it would be: 139687339197496

Cool so we see the id will change if we copy the object and we see that the dictionary is just holding a pointer to the list (thus the dict is significantly smaller in memory size).

Now let's look into if a Process copies a dictionary.

def method1(var): print("method1 dict id is:", str(id(var))) def method2(var): print("method2 dict id is:", str(id(var))) if __name__ == "__main__": # creating processes p1 = multiprocessing.Process(target=method2, args=(myDict, )) p2 = multiprocessing.Process(target=method1, args=(myDict, )) # starting process 1 p1.start() # starting process 2 p2.start() # wait until process 1 is finished p1.join() # wait until process 2 is finished p2.join() # both processes finished print("Done!") 

Here I am passing myDict as an arg to both my subprocess functions. This is what I get as an output:

method2 dict id is: 139687265270016
method1 dict id is: 139687265270016
Done!

Note: The id is the same as when we defined the dictionary earlier on in the code.

What does this mean?

If the id never changes then we are using the same object in all instances. So in theory if I make a change in Process it should change the main object. But that doesn't happen like we would expect.

For Example: Lets change our method1.

def method1(var): print("method1 dict id is:", str(id(var))) var[0][0] = 1 print("The first five elements of the list in the dict are:", var[0][:5]) 

AND add a couple prints after our p2.join():

p2.join() print("The first five elements of the list in the dict are:", myDict[0][:5]) print("The first five elements of the list are:", myList[:5]) 

My dict ID is: 140077406931128
My list size in bytes: 87624
My dict size with the list: 240
My dict ID is still: 140077406931128
But if I copied my dic it would be: 140077455160376
method1 dict id is: 140077406931128
The first five elements of the list in the dict are: [1, 0, 0, 0, 0]
method2 dict id is: 140077406931128
The first five elements of the list in the dict are: [0, 0, 0, 0, 0]
The first five elements of the list are: [0, 0, 0, 0, 0]
Done!

Well thats interesting... The ids are the same, and I can change the object in the function but dict doesn't change in the main process...

Well after some further investigation I found this question/answer: https://stackoverflow.com/a/14750086/8150685

When creating a child process the child inherits a copy of the parents process (including a copy of the ids!); however, (provided the OS you are using impements a COW (copy-on-write) the child and parents will use the same memory unless either the child or the parent makes a change to the data in which case memory will be allocated only for the variable you changed (in your case it would make a copy of the DataFrame you changed).

Sorry for the long post, but I figured the work flow would be good to look at.

Hopefully this helped. Don't forget to upvote the question and answer at https://stackoverflow.com/a/14750086/8150685 if it helped you.

Sign up to request clarification or add additional context in comments.

3 Comments

@connesy don't forget to mark it as an answer if it solved your question.
Thank you very much for the thorough response, that was enlightening!Thank you very much for the thorough response, that was enlightening! I have a couple of follow-up questions: 1: If I understand your example correctly, the id of the dict is copied to the child process, meaning the id of the dict in the child process will be the same as the id in the parent process, even if I make a change to the dict in one of the processes? 2: According to [this comment](bit.ly/2U8U5IX) on the q/a you linked, the DataFrames might be copied by reading them?
Only if you modify the data. Reading doesn't count as modifying. In other words if all you do is read the dataframe then no additional memory will be used; however, if you make a change the whole dataframe will be copied in the background. The ID of the dataframe however will still not be changed. (ID is local to that process so from my testing it did not change even if I changed the object in a separate process).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.