How create a new column based on other rows in pandas dataframe?

Question

I have a data frame with 200k rows and i try to add columns based on other rows with some conditions. I tried to achieve it but take a lot of time(2 hours).

Here is my code :

for index in dataset.index: A_id = dataset.loc[index, 'A_id'] B_id = dataset.loc[index, 'B_id'] C_date = dataset.loc[index, 'C_date'] subset = dataset[ (dataset['A_id'] == A_id) & (dataset['B_id'] == B_id) & ( dataset['C_date'] < C_date)] dataset.at[index, 'D_mean'] = subset['D'].mean() dataset.at[index, 'E_mean'] = subset['E'].mean()

My data frame looks this:

A = [1, 2, 1, 2, 1, 2] B = [10, 20, 10, 20, 10, 20] C = ["22-02-2019", "28-02-19", "07-03-2019", "14-03-2019", "21-12-2019", "11-10-2019"] D = [10, 12, 21, 81, 20, 1] E = [7, 10, 14, 31, 61, 9] dataset = pd.DataFrame({ 'A_id': A, 'B_id': B, 'C_date': C, 'D': D, 'E': E, }) dataset.C_date = pd.to_datetime(dataset.C_date) dataset Out[27]: A_id B_id C_date D E 0 1 10 2019-02-22 10 7 1 2 20 2019-02-28 12 10 2 1 10 2019-07-03 21 14 3 2 20 2019-03-14 81 31 4 1 10 2019-12-21 20 61 5 2 20 2019-11-10 1 9

I would like to have this result in better effective way than my solution :

 A_id B_id C_date D E D_mean E_mean 0 1 10 2019-02-22 10 7 NaN NaN 1 2 20 2019-02-28 12 10 NaN NaN 2 1 10 2019-07-03 21 14 10.0 7.0 3 2 20 2019-03-14 81 31 12.0 10.0 4 1 10 2019-12-21 20 61 15.5 10.5 5 2 20 2019-11-10 1 9 46.5 20.5

Do you have an idea ?

gold_cy · Accepted Answer · 2020-01-03 14:09:01Z

We can use a combination of functions to achieve this, most notable the pd.DataFrame.rolling to calculate the moving average.

def custom_agg(group): cols = ['D', 'E'] for col in cols: name = '{}_mean'.format(col) group[name] = group[col].shift() \ .rolling(len(group[col]), min_periods=2) \ .mean() \ .fillna(group[col].iloc[0]) group[name].iloc[0] = pd.np.nan return group dataset.groupby(['A_id', 'B_id'], as_index=False).apply(custom_agg) A_id B_id C_date D E D_mean E_mean 0 1 10 2019-02-22 10 7 NaN NaN 1 2 20 2019-02-28 12 10 NaN NaN 2 1 10 2019-07-03 21 14 10.0 7.0 3 2 20 2019-03-14 81 31 12.0 10.0 4 1 10 2019-12-21 20 61 15.5 10.5 5 2 20 2019-11-10 1 9 46.5 20.5

There might be an even more elegant way of doing this, however you should already see a performance increase using this method. Just make sure the C_date column is sorted ahead of time since it is a moving average.

That's worked for me. Your solution is 7x faster than mine. Thank you

Phil · Accepted Answer · 2020-01-03 14:14:57Z

I suspected that your creation of subset in the loop was expensive, and my testing revealed that your algorithm was running at about ~11,000 indices per minute. I came up with an alternative algorithm that pre-sorts the data so that computing the subset becomes trivial, and running over a 200k-row dataset of random data takes under 5 minutes.

dataset.sort_values(by=['A_id', 'B_id', 'C_date'], inplace=True) dataset.reset_index(drop=True, inplace=True) last_A = None last_B = None first_index = -1 for index in dataset.index: A_id = dataset.loc[index, 'A_id'] B_id = dataset.loc[index, 'B_id'] C_date = dataset.loc[index, 'C_date'] if (last_A != A_id) | (last_B != B_id): first_index = index last_A = A_id last_B = B_id subset = dataset[first_index:index] dataset.at[index, 'D_mean'] = subset['D'].mean() dataset.at[index, 'E_mean'] = subset['E'].mean()

iterating over every row in a dataframe is not an optimal or ideal solution
I was unaware of the apply function. I will rework my solution to utilize that.
simply using apply is the same as iterating over every row, the ultimate goal of pandas is to use their vectorized built-in functions

YOLO · Accepted Answer · 2020-01-03 14:22:05Z

Here's one way to do using .apply:

dataset[['D_mean', 'E_mean']] = (dataset .apply(lambda df: dataset[(dataset['A_id'] == df['A_id']) & (dataset['B_id'] == df['B_id']) & (dataset['C_date'] < df['C_date']) ][['D','E']].mean(axis=0), axis=1) A_id B_id C_date D E D_mean E_mean 0 1 10 2019-02-22 10 7 NaN NaN 1 2 20 2019-02-28 12 10 NaN NaN 2 1 10 2019-07-03 21 14 10.0 7.0 3 2 20 2019-03-14 81 31 12.0 10.0 4 1 10 2019-12-21 20 61 15.5 10.5 5 2 20 2019-11-10 1 9 46.5 20.5

Collectives™ on Stack Overflow

How create a new column based on other rows in pandas dataframe?

3 Answers 3

1 Comment

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Related