0

I have a pyspark dataframe, where I want to group by some index, and combine all the values in each column into one list per column.

Example input:

id_1| id_2| id_3|timestamp|thing1|thing2|thing3 A | b | c |time_0 |1.2 |1.3 |2.5 A | b | c |time_1 |1.1 |1.5 |3.4 A | b | c |time_2 |2.2 |2.6 |2.9 A | b | d |time_0 |5.1 |5.5 |5.7 A | b | d |time_1 |6.1 |6.2 |6.3 A | b | e |time_0 |0.1 |0.5 |0.9 A | b | e |time_1 |0.2 |0.3 |0.6 

Example output:

id_1|id_2|id_3| timestamp |thing1 |thing2 |thing3 A |b | c |[time_0,time_1,time_2]|[1.2,1.1,2.2]|[1.3,1.5,2.6|[2.5,3.4,2.9] A |b | d |[time_0,time_1] |[5.1,6.1] |[5.5,6.2] |[5.7,6.3] A |b | e |[time_0,time_1] |[0.1,0.2] |[0.5,0.3] |[0.9,0.6] 

How can I do this efficiently?

1
  • What have you tried? Take a look at collect_list function and ‘agg’ function Commented Apr 3, 2019 at 22:50

2 Answers 2

2

Use collect_list() as people have suggested above as well.

# Creating the DataFrame df =sqlContext.createDataFrame([('A','b','c','time_0',1.2,1.3,2.5),('A','b','c','time_1',1.1,1.5,3.4), ('A','b','c','time_2',2.2,2.6,2.9),('A','b','d','time_0',5.1,5.5,5.7), ('A','b', 'd','time_1',6.1,6.2,6.3),('A','b','e','time_0',0.1,0.5,0.9), ('A','b', 'e','time_1',0.2,0.3,0.6)], ['id_1','id_2','id_3','timestamp','thing1','thing2','thing3']) df.show() +----+----+----+---------+------+------+------+ |id_1|id_2|id_3|timestamp|thing1|thing2|thing3| +----+----+----+---------+------+------+------+ | A| b| c| time_0| 1.2| 1.3| 2.5| | A| b| c| time_1| 1.1| 1.5| 3.4| | A| b| c| time_2| 2.2| 2.6| 2.9| | A| b| d| time_0| 5.1| 5.5| 5.7| | A| b| d| time_1| 6.1| 6.2| 6.3| | A| b| e| time_0| 0.1| 0.5| 0.9| | A| b| e| time_1| 0.2| 0.3| 0.6| +----+----+----+---------+------+------+------+ 

In addition to using agg(), you can write familiar SQL syntax to operate on it, but first we have to register our DataFrame as temporary SQL view -

df.createOrReplaceTempView("df_view") df = spark.sql("""select id_1, id_2, id_3, collect_list(timestamp) as timestamp, collect_list(thing1) as thing1, collect_list(thing2) as thing2, collect_list(thing3) as thing3 from df_view group by id_1, id_2, id_3""") df.show(truncate=False) +----+----+----+------------------------+---------------+---------------+---------------+ |id_1|id_2|id_3|timestamp |thing1 |thing2 |thing3 | +----+----+----+------------------------+---------------+---------------+---------------+ |A |b |d |[time_0, time_1] |[5.1, 6.1] |[5.5, 6.2] |[5.7, 6.3] | |A |b |e |[time_0, time_1] |[0.1, 0.2] |[0.5, 0.3] |[0.9, 0.6] | |A |b |c |[time_0, time_1, time_2]|[1.2, 1.1, 2.2]|[1.3, 1.5, 2.6]|[2.5, 3.4, 2.9]| +----+----+----+------------------------+---------------+---------------+---------------+ 

Note: The """ has been used to have multiline statements for the sake of visibility and neatness. With simple 'select id_1 ....' that wouldn't work if you try to spread your statement over multiple lines. Needless to say, the final result will be the same.

Sign up to request clarification or add additional context in comments.

Comments

0

Here is the example github TestExample1

 exampleDf = self.spark.createDataFrame( [('A', 'b', 'c', 'time_0', 1.2, 1.3, 2.5), ('A', 'b', 'c', 'time_1', 1.1, 1.5, 3.4), ], ("id_1", "id_2", "id_3", "timestamp", "thing1", "thing2", "thing3")) exampleDf.show() ans = exampleDf.groupBy(col("id_1"), col("id_2"), col("id_3")) \ .agg(collect_list(col("timestamp")), collect_list(col("thing1")), collect_list(col("thing2"))) ans.show() +----+----+----+---------+------+------+------+ |id_1|id_2|id_3|timestamp|thing1|thing2|thing3| +----+----+----+---------+------+------+------+ | A| b| c| time_0| 1.2| 1.3| 2.5| | A| b| c| time_1| 1.1| 1.5| 3.4| +----+----+----+---------+------+------+------+ +----+----+----+-----------------------+--------------------+--------------------+ |id_1|id_2|id_3|collect_list(timestamp)|collect_list(thing1)|collect_list(thing2)| +----+----+----+-----------------------+--------------------+--------------------+ | A| b| c| [time_0, time_1]| [1.2, 1.1]| [1.3, 1.5]| +----+----+----+-----------------------+--------------------+--------------------+ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.