How to properly reference the previous Pandas DataFrame in the next method in a method chain?

Question

I have been trying to use method chaining in Pandas however there are a few things related to how you reference a DataFrame or its columns that keep tripping me up.

For example in the code below I have filtered the dataset and then want to create a new column that sums the columns remaining after the filter. However I don't know how to reference the DataFrame that has just been created from the filter. df in the example below refers to the original DataFrame.

df = pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) df = ( df .filter(like='x') .assign(n = df .sum(axis=1)) ) df.head(6)

Or what about this instance, where the DataFrame is being created in the method chain, This would normally be a pd.read_csv step as opposed to generating the DataFrame. This piece of code would naturally not work as df2 has not been created as yet.

df2 = ( pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) .assign( xx = df2['xx'].mask(df2['xx']>2,0) ) ) df2.head(6)

Interestingly enough the issue above is not a problem here as df3['xx'] refers to the df3 that has been queried which makes some sense in the context of the second example but then does not make sense with the first example.

df3 = pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) df3 = ( df3 .query('xx > 3') .assign( xx = df3['xx'].mask(df3['xx']>4,0) ) ) df3.head(6)

I have worked in other languages/libraries such as R or PySpark and method chaining is quite flexible and does not appear to have these barriers. Unless there is something I am missing on how its meant to be done in Pandas or how you meant to reference df['xx'] in some other manner.

Lastly I understand that the example problems are easily worked around but I am trying to understand if there is a set method chaining syntax that I am maybe not aware of when referencing these columns.

use a lambda (anonymous function) : df.assign(xx = lambda df: ....) — sammywemmy
– sammywemmy, Commented Oct 13, 2021 at 7:07

sammywemmy · Accepted Answer · 2021-10-13 07:15:57Z

For referencing the DataFrame based on a previous computation, the anonymous function(lambda helps) :

df.filter(like='x').assign(n = lambda df: df.sum(1)) xx xy n 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 8 4 5 5 10 5 6 6 12

It basically references the previous DataFrame. This works with assign.

The pipe method is another option where you can chain methods while referencing the computed DataFrame.

The example below is superflous; hopefully it explains how pipe works:

df3.pipe(lambda df: df.assign(r = 2)) Out[37]: xx xy z r 0 1 1 1 2 1 2 2 2 2 2 3 3 3 2 3 4 4 4 2 4 5 5 5 2 5 6 6 6 2

Not all Pandas functions support chaining; this is where the pipe function could come in handy; you could even write custom functions and pass it to pipe.

All of this information is in the docs: assign; pipe; function application; assignment in method chaining

I have worked with the Pipe method in the past, but I think you nailed the solution here with the use of the lambda function. I see this is explained well in your link 'assignment in method chaining'. Thank you very much for the help on this. It was doing my head in for a while.

Collectives™ on Stack Overflow

How to properly reference the previous Pandas DataFrame in the next method in a method chain?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related