1

I am using Scala and Spark to unpivot a table which looks like as below:

+---+----------+--------+-------+------+-----+ | ID| Date | Type1 | Type2 | 0:30 | 1:00| +---+----------+--------+-------+------+-----+ | G| 12/3/2018| Import|Voltage| 3.5 | 6.8 | | H| 13/3/2018| Import|Voltage| 7.5 | 9.8 | | H| 13/3/2018| Export| Watt| 4.5 | 8.9 | | H| 13/3/2018| Export|Voltage| 5.6 | 9.1 | +---+----------+--------+-------+------+-----+ 

I want to transpose it as follow:

| ID|Date | Time|Import-Voltage |Export-Votage|Import-Watt|Export-Watt| | G|12/3/2018|0:30 |3.5 |0 |0 |0 | | G|12/3/2018|1:00 |6.8 |0 |0 |0 | | H|13/3/2018|0:30 |7.5 |5.6 |0 |4.5 | | H|13/3/2018|1:00 |9.8 |9.1 |0 |8.9 | 

And Time and Date columns should be also merged like

12/3/2018 0:30 
1
  • Interested to see who can do this Commented Jul 10, 2018 at 18:13

1 Answer 1

2

Not a straight forward task, but one approach would be to:

  1. group time and corresponding value into a "map" of time-value pairs
  2. flatten it out into a column of time-value pairs
  3. perform groupBy-pivot-agg transformation using time as part of the groupBy key and types as the pivot column to aggregate the time-corresponding value

Sample code below:

import org.apache.spark.sql.functions._ val df = Seq( ("G", "12/3/2018", "Import", "Voltage", 3.5, 6.8), ("H", "13/3/2018", "Import", "Voltage", 7.5, 9.8), ("H", "13/3/2018", "Export", "Watt", 4.5, 8.9), ("H", "13/3/2018", "Export", "Voltage", 5.6, 9.1) ).toDF("ID", "Date", "Type1", "Type2", "0:30", "1:00") df. withColumn("TimeValMap", array( struct(lit("0:30").as("_1"), $"0:30".as("_2")), struct(lit("1:00").as("_1"), $"1:00".as("_2")) )). withColumn("TimeVal", explode($"TimeValMap")). withColumn("Time", $"TimeVal._1"). withColumn("Types", concat_ws("-", array($"Type1", $"Type2"))). groupBy("ID", "Date", "Time").pivot("Types").agg(first($"TimeVal._2")). orderBy("ID", "Date", "Time"). na.fill(0.0). show // +---+---------+----+--------------+-----------+--------------+ // | ID| Date|Time|Export-Voltage|Export-Watt|Import-Voltage| // +---+---------+----+--------------+-----------+--------------+ // | G|12/3/2018|0:30| 0.0| 0.0| 3.5| // | G|12/3/2018|1:00| 0.0| 0.0| 6.8| // | H|13/3/2018|0:30| 5.6| 4.5| 7.5| // | H|13/3/2018|1:00| 9.1| 8.9| 9.8| // +---+---------+----+--------------+-----------+--------------+ 
Sign up to request clarification or add additional context in comments.

8 Comments

Impressive Leo. Why the use of lit? So, would stack not be an option? I thought indeed the title unpivot may not be quite correct. @Leo C
@thebluephantom, lit("0:30"), for example, is for capturing the literal time label "0:30" which (besides serving as a "key") will be in column Time. I agree unpivot may not best describe the requirement, more like a custom transpose.
How can I see the full value of TimeValMap?
In fact cannot see how withColumn("TimeValMap", array( struct(lit("0:30").as("_1"), $"0:30".as("_2")), struct(lit("1:00").as("_1"), $"1:00".as("_2")) )). actually works, interesting
To view the value, simply execute thru the code line that completes the first withColumn(). i.e. df.withColumn("TimeValMap", ...).show(false). For each of the two time labels, method struct creates a time-value tuple, and method array groups the two tuples in an array-type column. It's for pairing up the time label and its corresponding value so that the association information is not lost after the following transformations including the groupBy-pivot.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.