1

I have a DataFrame which goes on like this :

+---------+-------------+--------------------+--------+ | ID | reg_num| reg_typ|reg_code| +---------+-------------+--------------------+--------+ |523528690| 134886307000|Chamber of Commer | 14246| |523528690|2015 / 369956|Government Gazett | 14225| |523528690| 997253630|Tax Registration | 14259| |523528691| 997253633|Tax Doc | 14250| |523528691| 997253634|Tax File | 14251| |523528691| 997253635|Tax Data | 14252| |523528691| 997253636|Tax Monitor | 14253| +---------+-------------+--------------------+--------+ 

Now I'm trying to achieve an output with the format as :

+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+ | ID | reg_num| reg_typ|reg_code| reg_1 | reg_2 | reg_3 | reg_4 | +---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+ |523528690| 134886307000|Chamber of Commer | 14246| 134886307000|2015 / 369956| 997253630 | null | |523528690|2015 / 369956|Government Gazett | 14225|134886307000 |2015 / 369956|997253630 |null | |523528690| 997253630|Tax Registration | 14259| 134886307000|2015 / 369956| 997253630 | null | |523528691| 997253633|Tax Doc | 14250| 997253633| 997253634| 997253635| 997253636| |523528691| 997253634|Tax File | 14251| 997253633| 997253634| 997253635| 997253636| |523528691| 997253635|Tax Data | 14252| 997253633| 997253634| 997253635| 997253636| |523528691| 997253636|Tax Monitor | 14253| 997253633| 997253634| 997253635| 997253636| +---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+ 

I have seen predefined functionalities like pivot, however it doesnt seem to fit in my case.

I'm using Spark version 1.6 and Scala version 2.10.5.

Help is appriciated!!

3
  • @eliasah The solution solves the problem and performs as needed. Thanks :) Commented May 31, 2017 at 9:09
  • Glad to hear that ! Commented May 31, 2017 at 9:09
  • @eliasah Just a question, when i am trying this over a large dataset, the arrangement of the reg_1,..reg_4 columns is not as per the sequence in the original dataframe, as in the 1st reg_num does not corresponds to reg_1. Is it because of the order by clause that is being used with the window function? Commented Jun 5, 2017 at 6:34

1 Answer 1

2

pivot is the way to go but the logic behind it is not obvious :

import org.apache.spark.sql.expressions.Window val df = Seq( (523528690, "134886307000", "Chamber of Commer", 14246), (523528690, "2015 / 369956", "Government Gazett", 14225), (523528690, "997253630", "Tax Registration", 14259), (523528691, "997253633", "Tax Doc", 14250), (523528691, "997253634", "Tax File", 14251), (523528691, "997253635", "Tax Data", 14252), (523528691, "997253636", "Tax Monitor", 14253)).toDF("id", "reg_num", "reg_type", "reg_code") val w = Window.partitionBy("id").orderBy("reg_num") df.show // +---------+-------------+-----------------+--------+ // | id| reg_num| reg_type|reg_code| // +---------+-------------+-----------------+--------+ // |523528690| 134886307000|Chamber of Commer| 14246| // |523528690|2015 / 369956|Government Gazett| 14225| // |523528690| 997253630| Tax Registration| 14259| // |523528691| 997253633| Tax Doc| 14250| // |523528691| 997253634| Tax File| 14251| // |523528691| 997253635| Tax Data| 14252| // |523528691| 997253636| Tax Monitor| 14253| // +---------+-------------+-----------------+--------+ val df2 = df.join(df.withColumn("rn", row_number.over(w)).groupBy("id").pivot("rn").agg(first("reg_num")), Seq("id")) df2.show // +---------+-------------+-----------------+--------+------------+-------------+---------+---------+ // | id| reg_num| reg_type|reg_code| 1| 2| 3| 4| // +---------+-------------+-----------------+--------+------------+-------------+---------+---------+ // |523528690| 134886307000|Chamber of Commer| 14246|134886307000|2015 / 369956|997253630| null| // |523528690|2015 / 369956|Government Gazett| 14225|134886307000|2015 / 369956|997253630| null| // |523528690| 997253630| Tax Registration| 14259|134886307000|2015 / 369956|997253630| null| // |523528691| 997253633| Tax Doc| 14250| 997253633| 997253634|997253635|997253636| // |523528691| 997253634| Tax File| 14251| 997253633| 997253634|997253635|997253636| // |523528691| 997253635| Tax Data| 14252| 997253633| 997253634|997253635|997253636| // |523528691| 997253636| Tax Monitor| 14253| 997253633| 997253634|997253635|997253636| // +---------+-------------+-----------------+--------+------------+-------------+---------+---------+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.