How to unpivot the table based on the multiple columns

Question

I am using Scala and Spark to unpivot a table which looks like as below:

+---+----------+--------+-------+------+-----+ | ID| Date | Type1 | Type2 | 0:30 | 1:00| +---+----------+--------+-------+------+-----+ | G| 12/3/2018| Import|Voltage| 3.5 | 6.8 | | H| 13/3/2018| Import|Voltage| 7.5 | 9.8 | | H| 13/3/2018| Export| Watt| 4.5 | 8.9 | | H| 13/3/2018| Export|Voltage| 5.6 | 9.1 | +---+----------+--------+-------+------+-----+

I want to transpose it as follow:

| ID|Date | Time|Import-Voltage |Export-Votage|Import-Watt|Export-Watt| | G|12/3/2018|0:30 |3.5 |0 |0 |0 | | G|12/3/2018|1:00 |6.8 |0 |0 |0 | | H|13/3/2018|0:30 |7.5 |5.6 |0 |4.5 | | H|13/3/2018|1:00 |9.8 |9.1 |0 |8.9 |

And Time and Date columns should be also merged like

12/3/2018 0:30

Interested to see who can do this

Ged
– Ged

2018-07-10 18:13:58 +00:00
Commented Jul 10, 2018 at 18:13 — Ged
– Ged, Commented Jul 10, 2018 at 18:13

Leo C · Accepted Answer · 2018-07-10 21:03:56Z

Not a straight forward task, but one approach would be to:

group time and corresponding value into a "map" of time-value pairs
flatten it out into a column of time-value pairs
perform groupBy-pivot-agg transformation using time as part of the groupBy key and types as the pivot column to aggregate the time-corresponding value

Sample code below:

import org.apache.spark.sql.functions._ val df = Seq( ("G", "12/3/2018", "Import", "Voltage", 3.5, 6.8), ("H", "13/3/2018", "Import", "Voltage", 7.5, 9.8), ("H", "13/3/2018", "Export", "Watt", 4.5, 8.9), ("H", "13/3/2018", "Export", "Voltage", 5.6, 9.1) ).toDF("ID", "Date", "Type1", "Type2", "0:30", "1:00") df. withColumn("TimeValMap", array( struct(lit("0:30").as("_1"), $"0:30".as("_2")), struct(lit("1:00").as("_1"), $"1:00".as("_2")) )). withColumn("TimeVal", explode($"TimeValMap")). withColumn("Time", $"TimeVal._1"). withColumn("Types", concat_ws("-", array($"Type1", $"Type2"))). groupBy("ID", "Date", "Time").pivot("Types").agg(first($"TimeVal._2")). orderBy("ID", "Date", "Time"). na.fill(0.0). show // +---+---------+----+--------------+-----------+--------------+ // | ID| Date|Time|Export-Voltage|Export-Watt|Import-Voltage| // +---+---------+----+--------------+-----------+--------------+ // | G|12/3/2018|0:30| 0.0| 0.0| 3.5| // | G|12/3/2018|1:00| 0.0| 0.0| 6.8| // | H|13/3/2018|0:30| 5.6| 4.5| 7.5| // | H|13/3/2018|1:00| 9.1| 8.9| 9.8| // +---+---------+----+--------------+-----------+--------------+

Impressive Leo. Why the use of lit? So, would stack not be an option? I thought indeed the title unpivot may not be quite correct. @Leo C
@thebluephantom, lit("0:30"), for example, is for capturing the literal time label "0:30" which (besides serving as a "key") will be in column Time. I agree unpivot may not best describe the requirement, more like a custom transpose.
In fact cannot see how withColumn("TimeValMap", array( struct(lit("0:30").as("_1"), $"0:30".as("_2")), struct(lit("1:00").as("_1"), $"1:00".as("_2")) )). actually works, interesting
To view the value, simply execute thru the code line that completes the first withColumn(). i.e. df.withColumn("TimeValMap", ...).show(false). For each of the two time labels, method struct creates a time-value tuple, and method array groups the two tuples in an array-type column. It's for pairing up the time label and its corresponding value so that the association information is not lost after the following transformations including the groupBy-pivot.

Collectives™ on Stack Overflow

How to unpivot the table based on the multiple columns

1 Answer 1

8 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Related