34

I face a problem of modification of a dataframe inside a function that I have never observed previously. Is there a method to deal with this so that the initial dataframe is not modified.

def test(df): df['tt'] = np.nan return df dff = pd.DataFrame(data=[]) 

Now, when I print dff, the output is

Empty DataFrame Columns: [] Index: [] 

If I pass dff to test() defined above, dff is modified. In other words,

df = test(dff) print(dff) 

now prints

Empty DataFrame Columns: [tt] Index: [] 

How do I make sure dff is not modified after being passed to test()?

7
  • 5
    Pass a copy of the dataframe? Or make one inside the function, and mutate and return that? It's bad form to mutate an argument and return anything other than None. Commented Jul 24, 2015 at 15:09
  • It's a solution but not memory efficient. But it's the first time I face that. Due to the version 0.16.2 ? Commented Jul 24, 2015 at 15:10
  • 1
    you can call .copy() to take an explicit deep copy Commented Jul 24, 2015 at 15:10
  • 1
    Nope, nothing to do with changing versions - this behaviour is the same for all mutable objects passed to Python functions, unique neither to Pandas generally nor v0.16.2 specifically. Commented Jul 24, 2015 at 15:11
  • Can you tell us a bit more about your use case? If you want to return the df at the end of the function, I don't think you can avoid doing a .copy() Commented Jul 24, 2015 at 22:27

2 Answers 2

65
def test(df): df = df.copy(deep=True) df['tt'] = np.nan return df 

If you pass the dataframe into a function and manipulate it and return the same dataframe, you are going to get the same dataframe in modified version. If you want to keep your old dataframe and create a new dataframe with your modifications then by definition you have to have 2 dataframes. The one that you pass in that you don't want modified and the new one that is modified. Therefore, if you don't want to change the original dataframe your best bet is to make a copy of the original dataframe. In my example I rebound the variable "df" in the function to the new copied dataframe. I used the copy method and the argument "deep=True" makes a copy of the dataframe and its contents. You can read more here:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html

Sign up to request clarification or add additional context in comments.

2 Comments

Is this also true for pyspark dataframes?
Thanks! I've been using pandas for a while and just came across this myself just now. training a model on a dataframe and inside training function it make some changes to df but does not return it. This still leads to modification of original dataframe. Copying is the only way?
4

As Skorpeo mentioned, since a dataframe can be modified in-place, it can be modified inside a function. One way to not modify the original is to make a new copy inside the function as in Skorpeo's answer.

If you don't want to change the function, passing a copy is also an option:

def test(df): df['tt'] = np.nan return df df = test(dff.copy()) # <---- pass a copy of `dff` 

1 Comment

I was wondering if deep=True was not a necessary argument for the copy, then I found out deep=True is the default.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.