Groupby to create new columns

Question

From a dataframe, I want to create a dataframe with new columns if the index is already found without knowing how many columns I will create :

pd.DataFrame([["John","guitar"],["Michael","football"],["Andrew","running"],["John","dancing"],["Andrew","cars"]])

and I want :

pd.DataFrame([["John","guitar","dancing"],["Michael","Football",None],["Andrew","running","cars"]])

without knowing how many columns I should create at the start.

@FFL75 updated with a faster alternative, more suited for large dataframes — yatu
– yatu, Commented Dec 13, 2018 at 14:37
Actually, in my real use case knowing that the value repeat is very great and it is true that in this example it is not stated to show unique values :) — Arli94
– Arli94, Commented Dec 13, 2018 at 14:44

yatu · Accepted Answer · 2018-12-13 14:44:30Z

df = pd.DataFrame([["John","guitar"],["Michael","football"],["Andrew","running"],["John","dancing"],["Andrew","cars"]], columns = ['person','hobby'])

You can groupby person and search for unique in hobby. Then use .apply(pd.Series) to expand lists into columns:

df.groupby('person').hobby.unique().apply(pd.Series).reset_index() person 0 1 0 Andrew running cars 1 John guitar dancing 2 Michael football NaN

In the case of having a large dataframe, try the more efficient alternative:

df = df.groupby('person').hobby.unique() df = pd.DataFrame(df.values.tolist(), index=df.index).reset_index()

Which in essence does the same, but avoids looping over rows when applying pd.Series.

What is problem with your answer - why unique? why pd.Series what is really slow?
Yes agree that .apply(pd.Series) isn't the best choice for very larges dataframes, but it will do the job if that is not the case. "why unique" - well I assume that what OP wants is to have record of which hobbies are present in the dataframe for each person. Otherwise please let me know @ffl75

jezrael · Accepted Answer · 2018-12-13 14:48:11Z

Use GroupBy.cumcount for get counter and then reshape by unstack:

df1 = pd.DataFrame([["John","guitar"], ["Michael","football"], ["Andrew","running"], ["John","dancing"], ["Andrew","cars"]], columns=['a','b']) a b 0 John guitar 1 Michael football 2 Andrew running 3 John dancing 4 Andrew cars df = (df1.set_index(['a', df1.groupby('a').cumcount()])['b'] .unstack() .rename_axis(-1) .reset_index() .rename(columns=lambda x: x+1)) print (df) 0 1 2 0 Andrew running cars 1 John guitar dancing 2 Michael football NaN

Or aggregate list and create new dictionary by constructor:

s = df1.groupby('a')['b'].agg(list) df = pd.DataFrame(s.values.tolist(), index=s.index).reset_index() print (df) a 0 1 0 Andrew running cars 1 John guitar dancing 2 Michael football None

ayorgo · Accepted Answer · 2018-12-13 14:49:03Z

Assuming the column names being ['person', 'activity'] you can do

df_out = df.groupby('person').agg(list).reset_index() df_out = pd.concat([df_out, pd.DataFrame(df_out['activity'].values.tolist())], axis=1) df_out = df_out.drop('activity', 1)

giving you

 person 0 1 0 Andrew running cars 1 John guitar dancing 2 Michael football None

Collectives™ on Stack Overflow

Groupby to create new columns

3 Answers 3

2 Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Related