1

I have a dataset like below

+-------+--------+----------+----+----------+ |cytid |itemcode|percentile|mean|DTXN | +---------------+-----------+---+----------+ |800 | 7 | x4| y# |2/1/2017 | |700 | 13 | y#| oz |3/1/2017 | |800 | 24 | oz| oz |8/1/2017 | |900 | 747 | oz| IH |27/9/2017 | |700 | 798 | IH| yc |19/10/2017| |800 | 841 | yc| g5 |8/11/2017 | |800 | 867 | g5| IH |19/11/2017| |900 | 911 | IH| x4 |2/12/2017 | |800 | 945 | x4| ko |13/12/2017| |800 | 984 | ko| cd |27/12/2017| +-------+-------+----------+----+----------+ 

I need to produce an output like below

+-------+--------------------------------------+ |cytid | ipm | +----------------------------------------------+ |800 | x4,y#,oz,oz,yc,g5,g5,IH,x4,ko,ko,cd| |700 | y#,oz,IH,yc | |900 | oz,IH,IH,x4 | +-------+--------------------------------------+ 

To be precise i need to merge the columns itemcode, percentile, mean. So how to do this in scala

1 Answer 1

1

Given your input called data as below:

+-----+--------+----------+----+ |cytid|itemcode|percentile|mean| +-----+--------+----------+----+ |800 |7 |x4 |y# | |700 |13 |y# |oz | |800 |24 |oz |oz | |900 |747 |oz |IH | |700 |798 |IH |yc | |800 |841 |yc |g5 | |800 |867 |g5 |IH | |900 |911 |IH |x4 | |800 |945 |x4 |ko | |800 |984 |ko |cd | +-----+--------+----------+----+ 

You can achieve this output (a bit different from yours, because you did not take into account item_code) through:

data = data.groupBy("cytid") .agg(collect_set(array("itemcode", "percentile", "mean")).as("ipm") ) .withColumn("ipm", concat_ws(",", flatten(col("ipm")))) 

Good luck!

Sign up to request clarification or add additional context in comments.

1 Comment

thank you vilalabinot, it helped me a lot

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.