I'm using Java Spark and I have 1 Dataframe like this
+---+-----+------+ |id |color|datas | +----------------+ |1 |blue |data1| |1 |red |data2| |1 |orange|data3| |2 |black |data4| |2 | |data5| |2 |yellow| | |3 |white |data7| |3 | |data8| +----------------+ I need to modify this dataframe to look like this :
+---+--------------------+---------------------+ |id |color |datas | +----------------------------------------------+ |1 |[blue, red, orange] |[data1, data2, data3]| |2 |[black, yellow] |[data4, data5] | |3 |[white] |[data7, data8] | +----------------------------------------------+ I want to merge the data to create an 'array' of the same column but from differents rows based on the 'id' column.
I'm able to do it throught UserDefinedAggregateFunction but I can only do it one column at a time and it takes too much time to process.
Thank you
Edit : I'm using Spark 1.6