How to shift and stack two columns of pandas Dataframe into one column?

Question

I have a dataframe mentioned below:

 ETHNIC SEX USUBJID 0 HISPANIC OR LATINO F 16 1 HISPANIC OR LATINO M 8 2 HISPANIC OR LATINO Total__##!!?? 24 3 NOT HISPANIC OR LATINO F 25 4 NOT HISPANIC OR LATINO M 18 5 NOT HISPANIC OR LATINO Total__##!!?? 43 6 Total__##!!?? F 41 7 Total__##!!?? M 26 8 Total__##!!?? Total__##!!?? 67

Just copy above dataframe to clipboard and execute df = pd.read_clipboard('\s\s+') to load above dataframe.

I'm trying to transform it to following dataframe:

 stacked USUBJID 0 HISPANIC OR LATINO NaN <----- 0 F 16 1 M 8 2 Total__##!!?? 24 0 NOT HISPANIC OR LATINO NaN <----- 3 F 25 4 M 18 5 Total__##!!?? 43 0 Total__##!!?? NaN <----- 6 F 41 7 M 26 8 Total__##!!?? 67

I want to stack ETHNIC and SEX columns together under the value of ETHNIC column for each unique values in ETHNIC column.

I was trying something like this, which works but is I don't think a robust solution. I was trying to split it up to n (where n is the number of unique values in EHTNIC column) dataframes in a list with an empty row for each of the dataframe slices, then finally concatenating the list of the dataframe slices and doing the rest works.

cols = ['ETHNIC', 'SEX'] results = [] for v in df[cols[0]].unique(): results.append(pd.DataFrame([[None]*df.shape[1]], columns=df.columns)) results.append(df[df[cols[0]].eq(v)]) results = pd.concat(results) results[cols[0]] = results[cols[0]].bfill() results['stacked'] = results.apply(lambda x: x['SEX'] if x['SEX'] else x['ETHNIC'], axis=1) results = results.drop(columns=cols)[['stacked', 'USUBJID']]

kindly post the source code : df.to_dict('records'). I am having difficulty in copyinng the shared data — sammywemmy
– sammywemmy, Commented Jul 21, 2021 at 8:35
@sammywemmy, Just copy the dataframe and try df = pd.read_clipboard('\s\s+') I tested it is working, and if it still doesn't work for you then let me know, I'll add it as dict — ThePyGuy
– ThePyGuy, Commented Jul 21, 2021 at 8:37

Sunderam Dubey · Accepted Answer · 2022-05-30 15:42:54Z

Start by grouping on "ETHNIC" with pandas.DataFrame.groupby.

Each group will contain a DataFrame and keep only the ['SEX', 'USUBJID'] columns, just with a different name for "SEX", which is changed using pandas.DataFrame.rename.

The header is added taking the group name d.name and concatenating with the group DataFrame using pandas.concat

Finally, the first level of the MultiIndex that results from the operation is dropped with pandas.DataFrame.reset_index

(df.groupby('ETHNIC') .apply(lambda d: pd.concat([pd.DataFrame([{'stacked': d.name, 'USUBJID': np.NaN}]), d[['SEX', 'USUBJID']].rename(columns={'SEX': 'stacked'}) ])) .reset_index(level=0, drop=True) )

output:

 stacked USUBJID 0 HISPANIC OR LATINO NaN 0 F 16.0 1 M 8.0 2 Total__##!!?? 24.0 0 NOT HISPANIC OR LATINO NaN 3 F 25.0 4 M 18.0 5 Total__##!!?? 43.0 0 Total__##!!?? NaN 6 F 41.0 7 M 26.0 8 Total__##!!?? 67.0

NB. depending on exactly which index you want you can play around with reset_index
@ThePyGuy: it was ongoing, this is usually the part that takes most time ;)

Shubham Sharma · Accepted Answer · 2021-07-21 08:54:48Z

Let us try with reshape

from collections import defaultdict def reshape(): data = defaultdict(list) for k, g in df.groupby('ETHNIC'): data['stacked'] += [k, *g['SEX']] data['USUBJID'] += [np.nan, *g['USUBJID']] return data pd.DataFrame(reshape())

 stacked USUBJID 0 HISPANIC OR LATINO NaN 1 F 16.0 2 M 8.0 3 Total__##!!?? 24.0 4 NOT HISPANIC OR LATINO NaN 5 F 25.0 6 M 18.0 7 Total__##!!?? 43.0 8 Total__##!!?? NaN 9 F 41.0 10 M 26.0 11 Total__##!!?? 67.0

mozway · Accepted Answer · 2021-07-21 09:21:52Z

Primarily for fun, here is another option based on @Shubham Sharma's answer that doesn't require defaultdict. Even the dependency on numpy can be removed (see alternative at the end)

It only uses the pandas.DataFrame constructor and pandas.concat.

import numpy as np pd.concat([pd.DataFrame({'stacked': np.append(k, g['SEX']), 'USUBJID': np.append(np.NaN, g['USUBJID']), }) for k,g in df.groupby('ETHNIC') ])

output:

 stacked USUBJID 0 HISPANIC OR LATINO NaN 1 F 16.0 2 M 8.0 3 Total__##!!?? 24.0 0 NOT HISPANIC OR LATINO NaN 1 F 25.0 2 M 18.0 3 Total__##!!?? 43.0 0 Total__##!!?? NaN 1 F 41.0 2 M 26.0 3 Total__##!!?? 67.0

alternative without numpy:

pd.concat([pd.DataFrame({'stacked': [k]+g['SEX'].to_list(), 'USUBJID': [None]+g['USUBJID'].to_list(), }) for k,g in df.groupby('ETHNIC') ])

I agree, I just didn't think of it at that time. It is also faster ;)

Rob Raymond · Accepted Answer · 2021-07-21 09:12:21Z

You can use groupby().apply()

import io df = pd.read_csv(io.StringIO(""" ETHNIC SEX USUBJID 0 HISPANIC OR LATINO F 16 1 HISPANIC OR LATINO M 8 2 HISPANIC OR LATINO Total__##!!?? 24 3 NOT HISPANIC OR LATINO F 25 4 NOT HISPANIC OR LATINO M 18 5 NOT HISPANIC OR LATINO Total__##!!?? 43 6 Total__##!!?? F 41 7 Total__##!!?? M 26 8 Total__##!!?? Total__##!!?? 67"""), sep="\s\s+", engine="python") df.groupby("ETHNIC", as_index=False).apply( lambda d: pd.concat( [d.iloc[0,].to_frame().T.assign(USUBJID=np.nan), d.assign(ETHNIC=d.SEX), ] ).drop(columns="SEX") ).reset_index(drop=True)

	ETHNIC	USUBJID
0	HISPANIC OR LATINO	nan
1	F	16
2	M	8
3	Total__##!!??	24
4	NOT HISPANIC OR LATINO	nan
5	F	25
6	M	18
7	Total__##!!??	43
8	Total__##!!??	nan
9	F	41
10	M	26
11	Total__##!!??	67

sammywemmy · Accepted Answer · 2021-07-21 09:27:33Z

 # use `total` as a counter (d.assign(total=lambda df: pd.Series(np.where(df.SEX.str.startswith("Total"), df.index, np.nan)).bfill() ) .melt(['USUBJID', 'total'], ignore_index = False) .sort_index() .assign(temp = lambda df: df.variable.str.startswith("ETH").groupby(df.total).cumsum(), USUBJID = lambda df: np.where(df.variable.str.startswith("ETH"), np.nan, df.USUBJID)) # keep only first row for `ETHNIC` .query("variable == 'ETHNIC' and temp == 1 or variable=='SEX' and temp >= 1") .drop(columns=['variable','total', 'temp']) ) USUBJID value 0 NaN HISPANIC OR LATINO 0 16.0 F 1 8.0 M 2 24.0 Total__##!!?? 3 NaN NOT HISPANIC OR LATINO 3 25.0 F 4 18.0 M 5 43.0 Total__##!!?? 6 NaN Total__##!!?? 6 41.0 F 7 26.0 M 8 67.0 Total__##!!??

Personally, the other answers are simpler and easier to grok

Collectives™ on Stack Overflow

How to shift and stack two columns of pandas Dataframe into one column?

5 Answers 5

3 Comments

2 Comments

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

2 Comments

2 Comments

Comments

Comments

Linked

Related