How to merge two rows in Spark SQL?

Question

I need to merge rows in the same dataframe based on a key column "id". In the sample data frame, 1 row has data for id,name and age. The other row has id,name, and salary. Rows with same key 'id' have to be merged a single record in the final data frame. If there is just one record, should show them as well with null values [Smith, and Jake] as in example below.

The computation needs to happen on real time data, spark native function based solution would be ideal. I have tried filtering the records based on age and city columns to separate data frames and them perform a left join on ID. But its not very efficient. Looking for any alternate suggestions. Thanks in advance!

Sample Dataframe

val inputDF= Seq(("100","John", Some(35),None) ,("100","John", None,Some("Georgia")), ("101","Mike", Some(25),None), ("101","Mike", None,Some("New York")), ("103","Mary", Some(22),None), ("103","Mary", None,Some("Texas")), ("104","Smith", Some(25),None), ("105","Jake", None,Some("Florida"))) .toDF("id","name","age","city")

Input Dataframe

+---+-----+----+--------+ |id |name |age |city | +---+-----+----+--------+ |100|John |35 |null | |100|John |null|Georgia | |101|Mike |25 |null | |101|Mike |null|New York| |103|Mary |22 |null | |103|Mary |null|Texas | |104|Smith|25 |null | |105|Jake |null|Florida | +---+-----+----+--------+

Expected Output Dataframe

+---+-----+----+---------+ | id| name| age| city| +---+-----+----+---------+ |100| John| 35| Georgia| |101| Mike| 25| New York| |103| Mary| 22| Texas| |104|Smith| 25| null| |105| Jake|null| Florida| +---+-----+----+---------+

Jacek Laskowski · Accepted Answer · 2020-09-16 20:08:47Z

Use first or last standard functions with ignoreNulls flag on.

first standard function

val q = inputDF .groupBy("id", "name") .agg(first("age", ignoreNulls = true) as "age", first("city", ignoreNulls = true) as "city") .orderBy("id")

last standard function

val q = inputDF .groupBy("id","name") .agg(last("age", true) as "age", last("city") as "city") .orderBy("id")

Collectives™ on Stack Overflow

How to merge two rows in Spark SQL?

1 Answer 1

first standard function

last standard function

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

first standard function

last standard function

Comments

Related