2

I have a dataframe which I want to modify in a way that each row will containt the column name . for example :

FirstName LastName Jhon Doe David Lue 

to create the follwing

(FirstName=Jhon,LastName=Doe) (FirstName=David,LastName=Lue) 

I managed to do for df with 2 columns

val x = df.map { row => (names(0) + "=" +row(0) , names(1)+"="+rows(1)} 

but how can I do it with for loop for any number of columns?

Thanks

1 Answer 1

8

One option is to use foldLeft on the column names:

import org.apache.spark.sql.functions._ import org.apache.spark.sql.DataFrame import sqlContext.implicits._ val df = Seq( ("John", "Doe"), ("David", "Lue") ).toDF("first_name", "last_name") val x = df.columns.foldLeft(df) { (acc: DataFrame, colName: String) => acc.withColumn(colName, concat(lit(colName + "="), col(colName))) } x.show() 

Resulting in:

+----------------+-------------+ | first_name| last_name| +----------------+-------------+ | first_name=John|last_name=Doe| |first_name=David|last_name=Lue| +----------------+-------------+ 

If you then want to convert it to an RDD of tuples, you can call a map on it:

x.rdd.map(r => (r.getString(0), r.getString(1))) 

or even with Spark SQL's typed API:

x.as[(String, String)].rdd 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot ! It worked like a charm. Since Im a new user, my marking it as the right answer is calculated but not displayed. Thanks Again!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.