4

From a dataframe, I want to create a dataframe with new columns if the index is already found without knowing how many columns I will create :

pd.DataFrame([["John","guitar"],["Michael","football"],["Andrew","running"],["John","dancing"],["Andrew","cars"]]) 

and I want :

pd.DataFrame([["John","guitar","dancing"],["Michael","Football",None],["Andrew","running","cars"]]) 

without knowing how many columns I should create at the start.

2
  • 2
    @FFL75 updated with a faster alternative, more suited for large dataframes Commented Dec 13, 2018 at 14:37
  • Actually, in my real use case knowing that the value repeat is very great and it is true that in this example it is not stated to show unique values :) Commented Dec 13, 2018 at 14:44

3 Answers 3

6
df = pd.DataFrame([["John","guitar"],["Michael","football"],["Andrew","running"],["John","dancing"],["Andrew","cars"]], columns = ['person','hobby']) 

You can groupby person and search for unique in hobby. Then use .apply(pd.Series) to expand lists into columns:

df.groupby('person').hobby.unique().apply(pd.Series).reset_index() person 0 1 0 Andrew running cars 1 John guitar dancing 2 Michael football NaN 

In the case of having a large dataframe, try the more efficient alternative:

df = df.groupby('person').hobby.unique() df = pd.DataFrame(df.values.tolist(), index=df.index).reset_index() 

Which in essence does the same, but avoids looping over rows when applying pd.Series.

Sign up to request clarification or add additional context in comments.

2 Comments

What is problem with your answer - why unique? why pd.Series what is really slow?
Yes agree that .apply(pd.Series) isn't the best choice for very larges dataframes, but it will do the job if that is not the case. "why unique" - well I assume that what OP wants is to have record of which hobbies are present in the dataframe for each person. Otherwise please let me know @ffl75
1

Use GroupBy.cumcount for get counter and then reshape by unstack:

df1 = pd.DataFrame([["John","guitar"], ["Michael","football"], ["Andrew","running"], ["John","dancing"], ["Andrew","cars"]], columns=['a','b']) a b 0 John guitar 1 Michael football 2 Andrew running 3 John dancing 4 Andrew cars df = (df1.set_index(['a', df1.groupby('a').cumcount()])['b'] .unstack() .rename_axis(-1) .reset_index() .rename(columns=lambda x: x+1)) print (df) 0 1 2 0 Andrew running cars 1 John guitar dancing 2 Michael football NaN 

Or aggregate list and create new dictionary by constructor:

s = df1.groupby('a')['b'].agg(list) df = pd.DataFrame(s.values.tolist(), index=s.index).reset_index() print (df) a 0 1 0 Andrew running cars 1 John guitar dancing 2 Michael football None 

1 Comment

@RavinderSingh13 - Thank you.
0

Assuming the column names being ['person', 'activity'] you can do

df_out = df.groupby('person').agg(list).reset_index() df_out = pd.concat([df_out, pd.DataFrame(df_out['activity'].values.tolist())], axis=1) df_out = df_out.drop('activity', 1) 

giving you

 person 0 1 0 Andrew running cars 1 John guitar dancing 2 Michael football None 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.