How to transpose a dataframe in Spark?

Question

I have the following dataframe:

+--------------------+---+---+-----+----+--------+----+ | ak| 1 | 2 | 3 | 4 | 5 | 6 | +--------------------+---+---+-----+----+--------+----+ |8dce120638dbdf438 | 2| 1| 0| 0| 0| 0| |3fd28484316249e95 | 1| 0| 3| 1| 4| 5| |3636b43f64db33889 | 9| 3| 3| 4| 18| 11| +--------------------+---+---+-----+----+--------+----+

and I want transpose it to the following:

ak depth user_count 8dce120638dbdf438 1 2 8dce120638dbdf438 2 1 8dce120638dbdf438 3 0 8dce120638dbdf438 4 0 8dce120638dbdf438 5 0 8dce120638dbdf438 6 0 3fd28484316249e95 1 1 3fd28484316249e95 2 0 3fd28484316249e95 3 3 3fd28484316249e95 4 1 3fd28484316249e95 5 4 3fd28484316249e95 6 5 3fd28484316249e95 1 9 3fd28484316249e95 2 3 3fd28484316249e95 3 3 3fd28484316249e95 4 4 3fd28484316249e95 5 18 3fd28484316249e95 6 11

How to do it in Scala?

Tzach Zohar · Accepted Answer · 2018-03-03 15:53:38Z

A similar approach to @Ramesh Maharjan's, but without the use of UDFs - instead, using Spark's built-in array and struct functions to construct a similar array that can be exploded:

import org.apache.spark.sql.functions._ import spark.implicits._ import org.apache.spark.sql.types._ // per column name, create a struct (similar to a tuple) of the column name and value: def arrayItem(name: String) = struct(lit(name) cast IntegerType as "depth", $"$name" as "user_count") // create an array of these per column, explode it and select the relevant columns: df.withColumn("tmp", explode(array(df.columns.tail.map(arrayItem): _*))) .select($"ak", $"tmp.depth", $"tmp.user_count")

Anahcolus · Accepted Answer · 2018-03-03 10:39:35Z

The solution seems straight forward to collect the values with column names into an array form, then use explode function to separate each element of array into separate rows, then finally separate the key and value to separate columns.

Summarizing the above explanation into code with explanation are as given below

val columns = df.columns.tail //selecting columns to be changed to rows import org.apache.spark.sql.functions._ //defining udf for zipping the column names with value and returning as array of column names zipped with column values def zipUdf = udf((cols: collection.mutable.WrappedArray[String], vals: collection.mutable.WrappedArray[String]) => cols.zip(vals)) df.select(col("ak"), zipUdf(lit(columns), array(columns.map(col): _*)).as("depth")) //calling udf function above .withColumn("depth", explode(col("depth"))) //exploding the array column to be on separate rows .select(col("ak"), col("depth._1").as("depth"), col("depth._2").as("user_count")) //selecting columns as required in output .show(false)

You should have the following output

+-----------------+-----+----------+ |ak |depth|user_count| +-----------------+-----+----------+ |8dce120638dbdf438|1 |2 | |8dce120638dbdf438|2 |1 | |8dce120638dbdf438|3 |0 | |8dce120638dbdf438|4 |0 | |8dce120638dbdf438|5 |0 | |8dce120638dbdf438|6 |0 | |3fd28484316249e95|1 |1 | |3fd28484316249e95|2 |0 | |3fd28484316249e95|3 |3 | |3fd28484316249e95|4 |1 | |3fd28484316249e95|5 |4 | |3fd28484316249e95|6 |5 | |3636b43f64db33889|1 |9 | |3636b43f64db33889|2 |3 | |3636b43f64db33889|3 |3 | |3636b43f64db33889|4 |4 | |3636b43f64db33889|5 |18 | |3636b43f64db33889|6 |11 | +-----------------+-----+----------+

Collectives™ on Stack Overflow

How to transpose a dataframe in Spark?

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related