How can I create new columns with groupby in pandas?

Question

I have a pandas dataframe like this,

>>> data = { 'hotel_code': [1, 1, 1, 1, 1], 'feed': [1, 1, 1, 1, 2], 'price_euro': [100, 200, 250, 120, 130], 'client_nationality': ['fr', 'us', 'ru,de', 'gb', 'cn,us,br,il,fr,gb,de,ie,pk,pl'] } >>> df = pd.DataFrame(data) >>> df hotel_code feed price_euro client_nationality 0 1 1 100 fr 1 1 1 200 us 2 1 1 250 ru,de 3 1 1 120 gb 4 1 2 130 cn,us,br,il,fr,gb,de,ie,pk,pl

And here is expected output,

>>> data = { 'hotel_code': [1, 1], 'feed': [1, 2], 'cluster1': ['fr', 'cn,us,br,il,fr,gb,de,ie,pk,pl'], 'cluster2': ['us', np.nan], 'cluster3': ['ru,de', np.nan], 'cluster4': ['gb', np.nan], } >>> df = pd.DataFrame(data) >>> df hotel_code feed cluster1 cluster2 cluster3 cluster4 0 1 1 fr us ru,de gb 1 1 2 cn,us,br,il,fr,gb,de,ie,pk,pl NaN NaN NaN

I want to create cluster columns by unique hotel_code and feed but I have no idea. Cluster numbers are changeable. Any idea? Thanks in advance.

For instance, if there was ru, client_nationality for hotel_code=1 and feed=2, it would be ru in cluster2 for this row. — E. Zeytinci
– E. Zeytinci, Commented Jan 8, 2020 at 13:30

jezrael · Accepted Answer · 2020-01-08 13:42:34Z

Use GroupBy.cumcount for counter per groups, create MultiIndex by hotel_code with feed and counter Series and reshape by Series.unstack, last rename columns and DataFrame.reset_index for MultiIndex to columns:

g = df.groupby(["hotel_code", "feed"]).cumcount() df1 = (df.set_index(["hotel_code", "feed", g])['client_nationality'] .unstack() .rename(columns = lambda x: f'cluster_{x+1}') .reset_index()) print (df1) hotel_code feed cluster_1 cluster_2 cluster_3 \ 0 1 1 fr us ru,de 1 1 2 cn,us,br,il,fr,gb,de,ie,pk,pl NaN NaN cluster_4 0 gb 1 NaN

All the answers give me the result I expected but this is most efficient way. Thanks.

ignoring_gravity · Accepted Answer · 2020-01-08 13:34:41Z

You could create a new dataframe with your clusters:

clusters = pd.DataFrame( df.groupby(["hotel_code", "feed"]) .agg(list) .reset_index() .client_nationality.tolist() ) clusters.columns = [f"cluster_{i}" for i in range(1, clusters.shape[1] + 1)]

Then,

pd.concat( [ df.drop(["price_euro", "client_nationality"], axis=1) .drop_duplicates(["hotel_code", "feed"]) .reset_index(drop=True), clusters, ], axis=1, )

would return

 hotel_code feed cluster_1 cluster_2 cluster_3 cluster_4 0 1 1 fr us ru,de gb 1 1 2 cn,us,br,il,fr,gb,de,ie,pk,pl None None None

Vishnudev Krishnadas · Accepted Answer · 2020-01-09 09:59:59Z

Groupby on hotel_code and feed, then aggregate on client_nationality and finally split and expand.

Update columns with required suffix.

df.groupby(['hotel_code', 'feed'])['client_nationality'] .agg(' '.join) .str.split(' ', expand=True) .rename(columns = lambda x: f'cluster_{x+1}')

Output

 cluster_1 cluster_2 cluster_3 cluster_4 hotel_code feed 1 1 fr us ru,de gb 2 cn,us,br,il,fr,gb,de,ie,pk,pl None None None

Collectives™ on Stack Overflow

How can I create new columns with groupby in pandas?

3 Answers 3

1 Comment

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Related