3

I have a dataframe mentioned below:

 ETHNIC SEX USUBJID 0 HISPANIC OR LATINO F 16 1 HISPANIC OR LATINO M 8 2 HISPANIC OR LATINO Total__##!!?? 24 3 NOT HISPANIC OR LATINO F 25 4 NOT HISPANIC OR LATINO M 18 5 NOT HISPANIC OR LATINO Total__##!!?? 43 6 Total__##!!?? F 41 7 Total__##!!?? M 26 8 Total__##!!?? Total__##!!?? 67 

Just copy above dataframe to clipboard and execute df = pd.read_clipboard('\s\s+') to load above dataframe.

I'm trying to transform it to following dataframe:

 stacked USUBJID 0 HISPANIC OR LATINO NaN <----- 0 F 16 1 M 8 2 Total__##!!?? 24 0 NOT HISPANIC OR LATINO NaN <----- 3 F 25 4 M 18 5 Total__##!!?? 43 0 Total__##!!?? NaN <----- 6 F 41 7 M 26 8 Total__##!!?? 67 

I want to stack ETHNIC and SEX columns together under the value of ETHNIC column for each unique values in ETHNIC column.

I was trying something like this, which works but is I don't think a robust solution. I was trying to split it up to n (where n is the number of unique values in EHTNIC column) dataframes in a list with an empty row for each of the dataframe slices, then finally concatenating the list of the dataframe slices and doing the rest works.

cols = ['ETHNIC', 'SEX'] results = [] for v in df[cols[0]].unique(): results.append(pd.DataFrame([[None]*df.shape[1]], columns=df.columns)) results.append(df[df[cols[0]].eq(v)]) results = pd.concat(results) results[cols[0]] = results[cols[0]].bfill() results['stacked'] = results.apply(lambda x: x['SEX'] if x['SEX'] else x['ETHNIC'], axis=1) results = results.drop(columns=cols)[['stacked', 'USUBJID']] 
2
  • kindly post the source code : df.to_dict('records'). I am having difficulty in copyinng the shared data Commented Jul 21, 2021 at 8:35
  • 1
    @sammywemmy, Just copy the dataframe and try df = pd.read_clipboard('\s\s+') I tested it is working, and if it still doesn't work for you then let me know, I'll add it as dict Commented Jul 21, 2021 at 8:37

5 Answers 5

4

Start by grouping on "ETHNIC" with pandas.DataFrame.groupby.

Each group will contain a DataFrame and keep only the ['SEX', 'USUBJID'] columns, just with a different name for "SEX", which is changed using pandas.DataFrame.rename.

The header is added taking the group name d.name and concatenating with the group DataFrame using pandas.concat

Finally, the first level of the MultiIndex that results from the operation is dropped with pandas.DataFrame.reset_index

(df.groupby('ETHNIC') .apply(lambda d: pd.concat([pd.DataFrame([{'stacked': d.name, 'USUBJID': np.NaN}]), d[['SEX', 'USUBJID']].rename(columns={'SEX': 'stacked'}) ])) .reset_index(level=0, drop=True) ) 

output:

 stacked USUBJID 0 HISPANIC OR LATINO NaN 0 F 16.0 1 M 8.0 2 Total__##!!?? 24.0 0 NOT HISPANIC OR LATINO NaN 3 F 25.0 4 M 18.0 5 Total__##!!?? 43.0 0 Total__##!!?? NaN 6 F 41.0 7 M 26.0 8 Total__##!!?? 67.0 
Sign up to request clarification or add additional context in comments.

3 Comments

NB. depending on exactly which index you want you can play around with reset_index
I fixed a small error d.name instead of name
@ThePyGuy: it was ongoing, this is usually the part that takes most time ;)
2

Let us try with reshape

from collections import defaultdict def reshape(): data = defaultdict(list) for k, g in df.groupby('ETHNIC'): data['stacked'] += [k, *g['SEX']] data['USUBJID'] += [np.nan, *g['USUBJID']] return data pd.DataFrame(reshape()) 

 stacked USUBJID 0 HISPANIC OR LATINO NaN 1 F 16.0 2 M 8.0 3 Total__##!!?? 24.0 4 NOT HISPANIC OR LATINO NaN 5 F 25.0 6 M 18.0 7 Total__##!!?? 43.0 8 Total__##!!?? NaN 9 F 41.0 10 M 26.0 11 Total__##!!?? 67.0 

2 Comments

I guess defaultdict for the list of values in each group.
@ThePyGuy Yes, Exactly!
1

Primarily for fun, here is another option based on @Shubham Sharma's answer that doesn't require defaultdict. Even the dependency on numpy can be removed (see alternative at the end)

It only uses the pandas.DataFrame constructor and pandas.concat.

import numpy as np pd.concat([pd.DataFrame({'stacked': np.append(k, g['SEX']), 'USUBJID': np.append(np.NaN, g['USUBJID']), }) for k,g in df.groupby('ETHNIC') ]) 

output:

 stacked USUBJID 0 HISPANIC OR LATINO NaN 1 F 16.0 2 M 8.0 3 Total__##!!?? 24.0 0 NOT HISPANIC OR LATINO NaN 1 F 25.0 2 M 18.0 3 Total__##!!?? 43.0 0 Total__##!!?? NaN 1 F 41.0 2 M 26.0 3 Total__##!!?? 67.0 

alternative without numpy:

pd.concat([pd.DataFrame({'stacked': [k]+g['SEX'].to_list(), 'USUBJID': [None]+g['USUBJID'].to_list(), }) for k,g in df.groupby('ETHNIC') ]) 

2 Comments

I think this approach is cleaner than your previous answer.
I agree, I just didn't think of it at that time. It is also faster ;)
1

You can use groupby().apply()

import io df = pd.read_csv(io.StringIO(""" ETHNIC SEX USUBJID 0 HISPANIC OR LATINO F 16 1 HISPANIC OR LATINO M 8 2 HISPANIC OR LATINO Total__##!!?? 24 3 NOT HISPANIC OR LATINO F 25 4 NOT HISPANIC OR LATINO M 18 5 NOT HISPANIC OR LATINO Total__##!!?? 43 6 Total__##!!?? F 41 7 Total__##!!?? M 26 8 Total__##!!?? Total__##!!?? 67"""), sep="\s\s+", engine="python") df.groupby("ETHNIC", as_index=False).apply( lambda d: pd.concat( [d.iloc[0,].to_frame().T.assign(USUBJID=np.nan), d.assign(ETHNIC=d.SEX), ] ).drop(columns="SEX") ).reset_index(drop=True) 
ETHNIC USUBJID
0 HISPANIC OR LATINO nan
1 F 16
2 M 8
3 Total__##!!?? 24
4 NOT HISPANIC OR LATINO nan
5 F 25
6 M 18
7 Total__##!!?? 43
8 Total__##!!?? nan
9 F 41
10 M 26
11 Total__##!!?? 67

Comments

1
 # use `total` as a counter (d.assign(total=lambda df: pd.Series(np.where(df.SEX.str.startswith("Total"), df.index, np.nan)).bfill() ) .melt(['USUBJID', 'total'], ignore_index = False) .sort_index() .assign(temp = lambda df: df.variable.str.startswith("ETH").groupby(df.total).cumsum(), USUBJID = lambda df: np.where(df.variable.str.startswith("ETH"), np.nan, df.USUBJID)) # keep only first row for `ETHNIC` .query("variable == 'ETHNIC' and temp == 1 or variable=='SEX' and temp >= 1") .drop(columns=['variable','total', 'temp']) ) USUBJID value 0 NaN HISPANIC OR LATINO 0 16.0 F 1 8.0 M 2 24.0 Total__##!!?? 3 NaN NOT HISPANIC OR LATINO 3 25.0 F 4 18.0 M 5 43.0 Total__##!!?? 6 NaN Total__##!!?? 6 41.0 F 7 26.0 M 8 67.0 Total__##!!?? 

Personally, the other answers are simpler and easier to grok

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.