1

How to transpose rows to columns using RDD or data frame without pivot.

SessionId,date,orig, dest, legind, nbr 1 9/20/16,abc0,xyz0,o,1 1 9/20/16,abc1,xyz1,o,2 1 9/20/16,abc2,xyz2,i,3 1 9/20/16,abc3,xyz3,i,4 

So I want to generate new schema like:

SessionId,date,orig1, orig2, orig3, orig4, dest1, dest2, dest3,dest4 1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null 

Logic is if:

  • nbr is 1 and legind = o then orig1 value (fetch from row 1) ...

  • nbr is 3 and legind = i then dest1 value (fetch from row 3)

So how to transpose the rows to columns...

Any idea will be great appreciated.

Tried with below option but its just flatten all in single row..

val keys = List("SessionId"); val selectFirstValueOfNoneGroupedColumns = df.columns .filterNot(keys.toSet) .map(_ -> "first").toMap val grouped = df.groupBy(keys.head, keys.tail: _*) .agg(selectFirstValueOfNoneGroupedColumns).show() 
7
  • Weclome to SO. please spend some time on help to get clariry of how maturely you can ask question and format neatly Commented Oct 1, 2016 at 17:59
  • Yes it is straightforward and it already has an answer! Follow the link I gave. Commented Oct 1, 2016 at 18:28
  • @TheArchetypalPaul : the link you provided, it's different question and approach. Commented Oct 1, 2016 at 18:34
  • Ah, nothing to do with transpose then. It's not at all clear what your desired output is -please provide more data, It's particularly odd why you have abc0, and abc2 in the output, but xyz0, xyz1. It may be obvious to you what you're doing here, but it certainly isn't obvious to the reader that doesn't have the knowledge of the context. Commented Oct 1, 2016 at 18:38
  • My mistake... thanks for pointing out , output should be : 1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null Commented Oct 1, 2016 at 18:42

1 Answer 1

1

It is relatively simple if you use pivot function. First lets create a data set like the one in your question:

import org.apache.spark.sql.functions.{concat, first, lit, when} val df = Seq( ("1", "9/20/16", "abc0", "xyz0", "o", "1"), ("1", "9/20/16", "abc1", "xyz1", "o", "2"), ("1", "9/20/16", "abc2", "xyz2", "i", "3"), ("1", "9/20/16", "abc3", "xyz3", "i", "4") ).toDF("SessionId", "date", "orig", "dest", "legind", "nbr") 

then define and attach helper columns:

// This will be the column name val key = when($"legind" === "o", concat(lit("orig"), $"nbr")) .when($"legind" === "i", concat(lit("dest"), $"nbr")) // This will be the value val value = when($"legind" === "o", $"orig") // If o take origin .when($"legind" === "i", $"dest") // If i take dest val withKV = df.withColumn("key", key).withColumn("value", value) 

This will result in a DataFrame like this:

+---------+-------+----+----+------+---+-----+-----+ |SessionId| date|orig|dest|legind|nbr| key|value| +---------+-------+----+----+------+---+-----+-----+ | 1|9/20/16|abc0|xyz0| o| 1|orig1| abc0| | 1|9/20/16|abc1|xyz1| o| 2|orig2| abc1| | 1|9/20/16|abc2|xyz2| i| 3|dest3| xyz2| | 1|9/20/16|abc3|xyz3| i| 4|dest4| xyz3| +---------+-------+----+----+------+---+-----+-----+ 

Next let's define a list of possible levels:

val levels = Seq("orig", "dest").flatMap(x => (1 to 4).map(y => s"$x$y")) 

and finally pivot

val result = withKV .groupBy($"sessionId", $"date") .pivot("key", levels) .agg(first($"value", true)).show 

And the result is:

+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+ |sessionId| date|orig1|orig2|orig3|orig4|dest1|dest2|dest3|dest4| +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+ | 1|9/20/16| abc0| abc1| null| null| null| null| xyz2| xyz3| +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.