pyspark how to unpivot the csv with two header lines [duplicate]

Question

How can I use pyspark or pandas to achieve the below transformation? Thanks a lot. Source File is csv with following info:

expected:

can you add a code to create your sample data and what you have tried until now and where you are stuck? — Raghu
– Raghu, Commented Aug 3, 2020 at 3:27
Does this answer your question? Unpivot in spark-sql/pyspark — Raghu
– Raghu, Commented Aug 3, 2020 at 3:29

Valdi_Bo · Accepted Answer · 2020-08-03 06:30:24Z

I assume that your source CSV file is as follows:

 ,Group_1,Group_2 ,Sub_Group_1,Sub_Group_1 Maturity,7/31/20,7/31/20 0.1,0.2,0.3 0.2,0.3,0.4

Note that:

there are initial spaces in first 2 rows (names of 2 top MultiIndex levels for the first column), otherwise read_csv would give them Unnamed... default names,
there are actually 3 MultiIndex levels on columns (the third level contains dates).

I read them calling df = pd.read_csv('Input.csv', header=[0,1,2]) and the content is:

 Group_1 Group_2 Sub_Group_1 Sub_Group_1 Maturity 7/31/20 7/31/20 0 0.1 0.2 0.3 1 0.2 0.3 0.4

(you should have included it instead of a picture).

The first step is melt:

result = pd.melt(df, id_vars=[(' ', ' ', 'Maturity')], var_name=['Group_Name', 'Sub_Group_Name', 'Date'])

Note that in the above code I specified id_vars as a tuple, with 2 top elements as spaces (this is why I made the assumption about initial spaces).

The result is:

 ( , , Maturity) Group_Name Sub_Group_Name Date value 0 0.1 Group_1 Sub_Group_1 7/31/20 0.2 1 0.2 Group_1 Sub_Group_1 7/31/20 0.3 2 0.1 Group_2 Sub_Group_1 7/31/20 0.3 3 0.2 Group_2 Sub_Group_1 7/31/20 0.4

And the only thing to do is to rename the first column:

result.rename(columns={result.columns[0]: 'Maturity'}, inplace=True)

Now, when you print the result, you will get:

 Maturity Group_Name Sub_Group_Name Date value 0 0.1 Group_1 Sub_Group_1 7/31/20 0.2 1 0.2 Group_1 Sub_Group_1 7/31/20 0.3 2 0.1 Group_2 Sub_Group_1 7/31/20 0.3 3 0.2 Group_2 Sub_Group_1 7/31/20 0.4

The row ordering is a bit different, but this is the way melt works. If you are unhappy about this detail, reorder rows according to your needs.

Collectives™ on Stack Overflow

pyspark how to unpivot the csv with two header lines [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related