I have been trying to use method chaining in Pandas however there are a few things related to how you reference a DataFrame or its columns that keep tripping me up.
For example in the code below I have filtered the dataset and then want to create a new column that sums the columns remaining after the filter. However I don't know how to reference the DataFrame that has just been created from the filter. df in the example below refers to the original DataFrame.
df = pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) df = ( df .filter(like='x') .assign(n = df .sum(axis=1)) ) df.head(6) Or what about this instance, where the DataFrame is being created in the method chain, This would normally be a pd.read_csv step as opposed to generating the DataFrame. This piece of code would naturally not work as df2 has not been created as yet.
df2 = ( pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) .assign( xx = df2['xx'].mask(df2['xx']>2,0) ) ) df2.head(6) Interestingly enough the issue above is not a problem here as df3['xx'] refers to the df3 that has been queried which makes some sense in the context of the second example but then does not make sense with the first example.
df3 = pd.DataFrame( { 'xx':[1,2,3,4,5,6], 'xy':[1,2,3,4,5,6], 'z':[1,2,3,4,5,6], } ) df3 = ( df3 .query('xx > 3') .assign( xx = df3['xx'].mask(df3['xx']>4,0) ) ) df3.head(6) I have worked in other languages/libraries such as R or PySpark and method chaining is quite flexible and does not appear to have these barriers. Unless there is something I am missing on how its meant to be done in Pandas or how you meant to reference df['xx'] in some other manner.
Lastly I understand that the example problems are easily worked around but I am trying to understand if there is a set method chaining syntax that I am maybe not aware of when referencing these columns.
df.assign(xx = lambda df: ....)