Use collect_list() as people have suggested above as well.
# Creating the DataFrame df =sqlContext.createDataFrame([('A','b','c','time_0',1.2,1.3,2.5),('A','b','c','time_1',1.1,1.5,3.4), ('A','b','c','time_2',2.2,2.6,2.9),('A','b','d','time_0',5.1,5.5,5.7), ('A','b', 'd','time_1',6.1,6.2,6.3),('A','b','e','time_0',0.1,0.5,0.9), ('A','b', 'e','time_1',0.2,0.3,0.6)], ['id_1','id_2','id_3','timestamp','thing1','thing2','thing3']) df.show() +----+----+----+---------+------+------+------+ |id_1|id_2|id_3|timestamp|thing1|thing2|thing3| +----+----+----+---------+------+------+------+ | A| b| c| time_0| 1.2| 1.3| 2.5| | A| b| c| time_1| 1.1| 1.5| 3.4| | A| b| c| time_2| 2.2| 2.6| 2.9| | A| b| d| time_0| 5.1| 5.5| 5.7| | A| b| d| time_1| 6.1| 6.2| 6.3| | A| b| e| time_0| 0.1| 0.5| 0.9| | A| b| e| time_1| 0.2| 0.3| 0.6| +----+----+----+---------+------+------+------+
In addition to using agg(), you can write familiar SQL syntax to operate on it, but first we have to register our DataFrame as temporary SQL view -
df.createOrReplaceTempView("df_view") df = spark.sql("""select id_1, id_2, id_3, collect_list(timestamp) as timestamp, collect_list(thing1) as thing1, collect_list(thing2) as thing2, collect_list(thing3) as thing3 from df_view group by id_1, id_2, id_3""") df.show(truncate=False) +----+----+----+------------------------+---------------+---------------+---------------+ |id_1|id_2|id_3|timestamp |thing1 |thing2 |thing3 | +----+----+----+------------------------+---------------+---------------+---------------+ |A |b |d |[time_0, time_1] |[5.1, 6.1] |[5.5, 6.2] |[5.7, 6.3] | |A |b |e |[time_0, time_1] |[0.1, 0.2] |[0.5, 0.3] |[0.9, 0.6] | |A |b |c |[time_0, time_1, time_2]|[1.2, 1.1, 2.2]|[1.3, 1.5, 2.6]|[2.5, 3.4, 2.9]| +----+----+----+------------------------+---------------+---------------+---------------+
Note: The """ has been used to have multiline statements for the sake of visibility and neatness. With simple 'select id_1 ....' that wouldn't work if you try to spread your statement over multiple lines. Needless to say, the final result will be the same.