I need to merge rows in the same dataframe based on a key column "id". In the sample data frame, 1 row has data for id,name and age. The other row has id,name, and salary. Rows with same key 'id' have to be merged a single record in the final data frame. If there is just one record, should show them as well with null values [Smith, and Jake] as in example below.
The computation needs to happen on real time data, spark native function based solution would be ideal. I have tried filtering the records based on age and city columns to separate data frames and them perform a left join on ID. But its not very efficient. Looking for any alternate suggestions. Thanks in advance!
Sample Dataframe
val inputDF= Seq(("100","John", Some(35),None) ,("100","John", None,Some("Georgia")), ("101","Mike", Some(25),None), ("101","Mike", None,Some("New York")), ("103","Mary", Some(22),None), ("103","Mary", None,Some("Texas")), ("104","Smith", Some(25),None), ("105","Jake", None,Some("Florida"))) .toDF("id","name","age","city") Input Dataframe
+---+-----+----+--------+ |id |name |age |city | +---+-----+----+--------+ |100|John |35 |null | |100|John |null|Georgia | |101|Mike |25 |null | |101|Mike |null|New York| |103|Mary |22 |null | |103|Mary |null|Texas | |104|Smith|25 |null | |105|Jake |null|Florida | +---+-----+----+--------+ Expected Output Dataframe
+---+-----+----+---------+ | id| name| age| city| +---+-----+----+---------+ |100| John| 35| Georgia| |101| Mike| 25| New York| |103| Mary| 22| Texas| |104|Smith| 25| null| |105| Jake|null| Florida| +---+-----+----+---------+