3

I saw this question here: Transpose DataFrame Without Aggregation in Spark with scala and I wanted to do exactly the opposite.

I have this Dataframe with a single row, with values that are string, int, bool, array:

+-----+-------+-----+------+-----+ |col1 | col2 |col3 | col4 |col5 | +-----+-------+-----+------+-----+ |val1 | val2 |val3 | val4 |val5 | +-----+-------+-----+------+-----+ 

And I want to transpose it like this:

+-----------+-------+ |Columns | values| +-----------+-------+ |col1 | val1 | |col2 | val2 | |col3 | val3 | |col4 | val4 | |col5 | val5 | +-----------+-------+ 

I am using Apache Spark 2.4.3 with Scala 2.11

Edit: Values can be of any type (int, double, bool, array), not only strings.

2 Answers 2

3

Thought differently with out using arrays_zip (which is available in => Spark 2.4)] and got the below...

It will work for Spark =>2.0 onwards in a simpler way (flatmap , map and explode functions)...

Here map function (used in with column) creates a new map column. The input columns must be grouped as key-value pairs.

Case : String data type in Data :

import org.apache.spark.sql.functions._ val df: DataFrame =Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5") var columnsAndValues = df.columns.flatMap { c => Array(lit(c), col(c)) } df.printSchema() df.withColumn("myMap", map(columnsAndValues:_*)).select(explode($"myMap")) .toDF("Columns","Values").show(false) 

Result :

root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col3: string (nullable = true) |-- col4: string (nullable = true) |-- col5: string (nullable = true) +-------+------+ |Columns|Values| +-------+------+ |col1 |val1 | |col2 |val2 | |col3 |val3 | |col4 |val4 | |col5 |val5 | +-------+------+ 

Case : Mix of data types in Data :

If you have different types convert them to String... remaining steps wont change..

val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*) 

Full Example :

import org.apache.spark.sql.functions._ import spark.implicits._ import org.apache.spark.sql.Column val df = Seq(((2), (3), (true), (2.4), ("val"))).toDF("col1", "col2", "col3", "col4", "col5") df.printSchema() /** * convert all columns to to string type since its needed further */ val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*) df1.printSchema() var ColumnsAndValues: Array[Column] = df.columns.flatMap { c => { Array(lit(c), col(c)) } } df1.withColumn("myMap", map(ColumnsAndValues: _*)) .select(explode($"myMap")) .toDF("Columns", "Values") .show(false) 

Result :

root |-- col1: integer (nullable = false) |-- col2: integer (nullable = false) |-- col3: boolean (nullable = false) |-- col4: double (nullable = false) |-- col5: string (nullable = true) root |-- col1: string (nullable = false) |-- col2: string (nullable = false) |-- col3: string (nullable = false) |-- col4: string (nullable = false) |-- col5: string (nullable = true) +-------+------+ |Columns|Values| +-------+------+ |col1 |2 | |col2 |3 | |col3 |true | |col4 |2.4 | |col5 |val | +-------+------+ 
Sign up to request clarification or add additional context in comments.

2 Comments

Hmmm I forgot to specify the types of the values.. Actually they are mixed (not only string, they can be int, float, bool, etc) and I got this Exception: org.apache.spark.sql.AnalysisException: cannot resolve 'map[...]' due to data type mismatch: The given values of function map should all be the same type
Do you know how this can be done in python i.e pyspark?
3

From Spark-2.4 Use arrays_zip with array(column_values), array(column_names) then explode to get the result.

Example:

val df=Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5") val cols=df.columns.map(x => col(s"${x}")) val str_cols=df.columns.mkString(",") df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))). select("new.*"). toDF("values","Columns"). show() //+------+-------+ //|values|Columns| //+------+-------+ //| val1| col1| //| val2| col2| //| val3| col3| //| val4| col4| //| val5| col5| //+------+-------+ 

UPDATE:

val df=Seq(((2),(3),(true),(2.4),("val"))).toDF("col1","col2","col3","col4","col5") df.printSchema //root // |-- col1: integer (nullable = false) // |-- col2: integer (nullable = false) // |-- col3: boolean (nullable = false) // |-- col4: double (nullable = false) // |-- col5: string (nullable = true) //cast to string val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}")) val str_cols=df.columns.mkString(",") df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))). select("new.*"). toDF("values","Columns"). show() //+------+-------+ //|values|Columns| //+------+-------+ //| 2| col1| //| 3| col2| //| true| col3| //| 2.4| col4| //| val| col5| //+------+-------+ 

7 Comments

nice what if if OP using lower version of spark
@RamGhadiyaram, Thanks!, then we need to use udf in scala taking reference as this post stackoverflow.com/a/41027619
@check my answer with out udf
@Shu, you can use SparkSQL function stack, check: stackoverflow.com/questions/42465568/….
Hmmm I forgot to specify the types of the values.. Actually they are mixed (not only string, they can be int, float, bool, etc) and I got this Exception: org.apache.spark.sql.AnalysisException: cannot resolve 'map[...]' due to data type mismatch: The given values of function map should all be the same type
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.