244

I have a pandas DataFrame, df_test. It contains a column 'size' which represents size in bytes. I've calculated KB, MB, and GB using the following code:

df_test = pd.DataFrame([ {'dir': '/Users/uname1', 'size': 994933}, {'dir': '/Users/uname2', 'size': 109338711}, ]) df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB') df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB') df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB') df_test dir size size_kb size_mb size_gb 0 /Users/uname1 994933 971.6 KB 0.9 MB 0.0 GB 1 /Users/uname2 109338711 106,776.1 KB 104.3 MB 0.1 GB [2 rows x 5 columns] 

I've run this over 120,000 rows and time it takes about 2.97 seconds per column * 3 = ~9 seconds according to %timeit.

Is there anyway I can make this faster? For example, can I instead of returning one column at a time from apply and running it 3 times, can I return all three columns in one pass to insert back into the original dataframe?

The other questions I've found all want to take multiple values and return a single value. I want to take a single value and return multiple columns.

2

13 Answers 13

257

You can return a Series from the applied function that contains the new data, preventing the need to iterate three times. Passing axis=1 to the apply function applies the function sizes to each row of the dataframe, returning a series to add to a new dataframe. This series, s, contains the new values, as well as the original data.

def sizes(s): s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB' s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB' s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB' return s df_test = df_test.append(rows_list) df_test = df_test.apply(sizes, axis=1) 
Sign up to request clarification or add additional context in comments.

7 Comments

I'm surprised it spent nearly 2 years without the right answer. I was looking for something else and stumbled on this. Hope it isn't too late to be useful!
What is rows_list in this answer?
If the pd.Series needs an index, you need to supply it with pd.Series(data, index=...). Otherwise you get cryptic errors when you try to assign the result back into the parent dataframe.
@Nelz11 I suggest you to use the same example provided in the question instead of the rows_list formulation so that your answer will compile without any problems (see also @David Stansby comment). I proposed this as an edit to avoid you the hassle, but evidently moderators prefer comments over edits.
Note that this may modify the original DataFrame, so you may want to df_test = df_test.copy().apply(sizes, axis=1) if that's not desired (but which of course may have performance / utilization implications).
|
207

Use apply and zip will 3 times fast than Series way.

def sizes(s): return locale.format_string("%.1f", s / 1024.0, grouping=True) + ' KB', \ locale.format_string("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \ locale.format_string("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB' df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes)) 

Test result are:

Separate df.apply(): 100 loops, best of 3: 1.43 ms per loop Return Series: 100 loops, best of 3: 2.61 ms per loop Return tuple: 1000 loops, best of 3: 819 µs per loop 

Performance Bar chart

My example code was developed using Jupyter and is available at this github

8 Comments

@Jesse This no longer works in pandas 1.1.* I get the shape error if I do an apply on the entire frame instead of specific columns
be aware that the zip approach does not retain the correct index. result_type=expand however will.
This is a nice answer, as it avoids the typical ValueError: Columns must be same length as key
This works for me in pandas 1.1.5.
This one provided an insane speedup from the method I was using, which was to parse dicts into DFs then concatenate them all.
|
167

Some of the current replies work fine, but I want to offer another, maybe more "pandifyed" option. This works for me with the current pandas 0.23 (not sure if it will work in previous versions):

import pandas as pd df_test = pd.DataFrame([ {'dir': '/Users/uname1', 'size': 994933}, {'dir': '/Users/uname2', 'size': 109338711}, ]) def sizes(s): a = locale.format_string("%.1f", s['size'] / 1024.0, grouping=True) + ' KB' b = locale.format_string("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB' c = locale.format_string("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB' return a, b, c df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand") 

Notice that the trick is on the result_type parameter of apply, that will expand its result into a DataFrame that can be directly assign to new/old columns.

3 Comments

The downside is this works with .apply() on DataFrames, not on Series. Also, with pandas 1.1.5 this doesn't work at all.
@MERose +1 for mentioning it won’t work on Series (for which apply() doesn’t have an axis parameter), but it works for me with 1.1.5.
For me, it is the most elegant and inbuilt supported solution. Works fine on pandas 1.3.0
44

Really cool answers! Thanks Jesse and jaumebonet! Just some observation in regards to:

  • zip(* ...
  • ... result_type="expand")

Although expand is kind of more elegant (pandifyed), **zip is at least 2x faster. On this simple example below, I got 4x faster.

import pandas as pd dat = [ [i, 10*i] for i in range(1000)] df = pd.DataFrame(dat, columns = ["a","b"]) def add_and_sub(row): add = row["a"] + row["b"] sub = row["a"] - row["b"] return add, sub df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand") # versus df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1)) 

2 Comments

"expand" doesn't work with pandas 1.3 but zip works magic! thanks
the second solution with zip worked very well.
31

Just another readable way. This code will add three new columns and its values, returning series without use parameters in the apply function.

def sizes(s): val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB' val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB' val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB' return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb']) df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1) 

A general example from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1) #foo bar #0 1 2 #1 1 2 #2 1 2 

1 Comment

With just one parameter, the lambda conversion isn't necessary: df.apply(x, axis=1). Also, it's basically the same solution as that of Jesse.
25

The performance between the top answers is significantly varied, and Jesse & famaral42 have already discussed this, but it is worth sharing a fair comparison between the top answers, and elaborating on a subtle but important detail of Jesse's answer: the argument passed in to the function, also affects performance.

(Python 3.7.4, Pandas 1.0.3)

import pandas as pd import locale import timeit def create_new_df_test(): df_test = pd.DataFrame([ {'dir': '/Users/uname1', 'size': 994933}, {'dir': '/Users/uname2', 'size': 109338711}, ]) return df_test def sizes_pass_series_return_series(series): series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB' series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB' series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB' return series def sizes_pass_series_return_tuple(series): a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB' b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB' c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB' return a, b, c def sizes_pass_value_return_tuple(value): a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB' b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB' c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB' return a, b, c 

Here are the results:

# 1 - Accepted (Nels11 Answer) - (pass series, return series): 9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple): 2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # 3 - Tuples (pass series, return tuple then zip): 1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip): 752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

Notice how returning tuples is the fastest method, but what is passed in as an argument, also affects the performance. The difference in the code is subtle but the performance improvement is significant.

Test #4 (passing in a single value) is twice as fast as test #3 (passing in a series), even though the operation performed is ostensibly identical.

But there's more...

# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist): 3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist): 2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # 3a - Tuples (pass series, return tuple then zip, new columns exist): 1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist): 694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

In some cases (#1a and #4a), applying the function to a DataFrame in which the output columns already exist is faster than creating them from the function.

Here is the code for running the tests:

# Paste and run the following in ipython console. It will not work if you run it from a .py file. print('\nAccepted Answer (pass series, return series, new columns dont exist):') df_test = create_new_df_test() %timeit result = df_test.apply(sizes_pass_series_return_series, axis=1) print('Accepted Answer (pass series, return series, new columns exist):') df_test = create_new_df_test() df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])]) %timeit result = df_test.apply(sizes_pass_series_return_series, axis=1) print('\nPandafied (pass series, return tuple, new columns dont exist):') df_test = create_new_df_test() %timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand") print('Pandafied (pass series, return tuple, new columns exist):') df_test = create_new_df_test() df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])]) %timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand") print('\nTuples (pass series, return tuple then zip, new columns dont exist):') df_test = create_new_df_test() %timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1)) print('Tuples (pass series, return tuple then zip, new columns exist):') df_test = create_new_df_test() df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])]) %timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1)) print('\nTuples (pass value, return tuple then zip, new columns dont exist):') df_test = create_new_df_test() %timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple)) print('Tuples (pass value, return tuple then zip, new columns exist):') df_test = create_new_df_test() df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])]) %timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple)) 

3 Comments

This is really interesting... And really some food for thought. I am always partial to use a library's provided solution, but the performance difference her cannot be ignored. I wonder now how my solution would fare if only the value, and not the series, was passed. Thanks for the nice analysis!
I would have appreciated a simple topline "this method is the fastest" rather than several paragraphs of pontificating and comparison across different units (ms vs. us). Yes they are trivial to determine yourself but think of all the poor Googlers and the purpose of Stack Overflow in the first place.
Appreciate the code - makes it clear what exactly is measured, rerun it on my machine and test different case. I found that results are quite different if testing with 1000 rows: #3 and #4 are only 50% apart, but #1 is x8 slower than 1a.
11

A fairly fast way to do this with apply and lambda. Just return the multiple values as a list and then use to_list()

import pandas as pd dat = [ [i, 10*i] for i in range(100000)] df = pd.DataFrame(dat, columns = ["a","b"]) def add_and_div(x): add = x + 3 div = x / 3 return [add, div] start = time.time() df[['c','d']] = df['a'].apply(lambda x: add_and_div(x)).to_list() end = time.time() print(end-start) # output: 0.27606 

Comments

9

Simply and easy:

def func(item_df): return [1,'Label 1'] if item_df['col_0'] > 0 else [0,'Label 0'] my_df[['col_1','col2']] = my_df.apply(func, axis=1,result_type='expand') 

Comments

3

I believe the 1.1 version breaks the behavior suggested in the top answer here.

import pandas as pd def test_func(row): row['c'] = str(row['a']) + str(row['b']) row['d'] = row['a'] + 1 return row df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']}) df.apply(test_func, axis=1) 

The above code ran on pandas 1.1.0 returns:

 a b c d 0 1 i 1i 2 1 1 i 1i 2 2 1 i 1i 2 

While in pandas 1.0.5 it returned:

 a b c d 0 1 i 1i 2 1 2 j 2j 3 2 3 k 3k 4 

Which I think is what you'd expect.

Not sure how the release notes explain this behavior, however as explained here avoiding mutation of the original rows by copying them resurrects the old behavior. i.e.:

def test_func(row): row = row.copy() # <---- Avoid mutating the original reference row['c'] = str(row['a']) + str(row['b']) row['d'] = row['a'] + 1 return row 

Comments

2

You can go 40+ times faster than the top answers here if you do your math in numpy instead. Adapting @Rocky K's top two answers. The main difference is running on an actual df of 120k rows. Numpy is way faster at math when you apply your functions array-wise (instead of applying a function value-wise). The best answer is by far the third one because it uses numpy for the math. Also notice that it only calculates 1024**2 and 1024**3 once each instead of once for each row, saving 240k calculations. Here are the timings on my machine:

Tuples (pass value, return tuple then zip, new columns dont exist): Runtime: 10.935037851333618 Tuples (pass value, return tuple then zip, new columns exist): Runtime: 11.120025157928467 Use numpy for math portions: Runtime: 0.24799370765686035 

Here is the script I used (adapted from Rocky K) to calculate these times:

import numpy as np import pandas as pd import locale import time size = np.random.random(120000) * 1000000000 data = pd.DataFrame({'Size': size}) def sizes_pass_value_return_tuple(value): a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB' b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB' c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB' return a, b, c print('\nTuples (pass value, return tuple then zip, new columns dont exist):') df1 = data.copy() start = time.time() df1['size_kb'], df1['size_mb'], df1['size_gb'] = zip(*df1['Size'].apply(sizes_pass_value_return_tuple)) end = time.time() print('Runtime:', end - start, '\n') print('Tuples (pass value, return tuple then zip, new columns exist):') df2 = data.copy() start = time.time() df2 = pd.concat([df2, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])]) df2['size_kb'], df2['size_mb'], df2['size_gb'] = zip(*df2['Size'].apply(sizes_pass_value_return_tuple)) end = time.time() print('Runtime:', end - start, '\n') print('Use numpy for math portions:') df3 = data.copy() start = time.time() df3['size_kb'] = (df3.Size.values / 1024).round(1) df3['size_kb'] = df3.size_kb.astype(str) + ' KB' df3['size_mb'] = (df3.Size.values / 1024 ** 2).round(1) df3['size_mb'] = df3.size_mb.astype(str) + ' MB' df3['size_gb'] = (df3.Size.values / 1024 ** 3).round(1) df3['size_gb'] = df3.size_gb.astype(str) + ' GB' end = time.time() print('Runtime:', end - start, '\n') 

Comments

1

Generally, to return multiple values, this is what I do

def gimmeMultiple(group): x1 = 1 x2 = 2 return array([[1, 2]]) def gimmeMultipleDf(group): x1 = 1 x2 = 2 return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2']) df['size'].astype(int).apply(gimmeMultiple) df['size'].astype(int).apply(gimmeMultipleDf) 

Returning a dataframe definitively has its perks, but sometimes not required. You can look at what the apply() returns and play a bit with the functions ;)

3 Comments

Thanks for this sample. However, this doesn't output a single dataframe for all of the results. When I try to add it back to the original dataframe, I get "ValueError: array is not broadcastable to correct shape".
Can you provide code to produce some small data sample?
Sure thing. I just updated the code in my original post to include sample data and output.
0

It gives a new dataframe with two columns from the original one.

import pandas as pd df = ... df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1) 

Comments

0

I wanted to use apply on groupby. I tried to use what you suggested here. It definitely helped me on the way but not all the way.

Adding result_type='expand' did not work (as I use apply on Series and not DataFrame?) and with zip(*___) I lose the index.

If anyone else comes here with the same issue, here is how I (finally) solved it:

dfg = df.groupby(by=['Column1','Column2']).Column3.apply(myfunc) dfres = pd.DataFrame() dfres['a'], dfres['b'], dfres['c'] = (dfg.apply(lambda x: x[0]), dfg.apply(lambda x: x[1]), dfg.apply(lambda x: x[2])) 

Or if you know a better way. Tell me.

And do let me know if this is too much out of scope for this discussion.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.