Concatenate row values based on group by in pyspark data frame

Question

I have a data framew in pyspark like below

df = spark.createDataFrame([('123', '2021-01-01', 1815, 9876), ('123', '2021-01-01', 1820, 9877) , ('123', '2021-01-01', 1828, 9878) , ('123', '2021-02-01', 1815, 9876) , ('123', '2021-02-01', 1820, 9877) , ('123', '2021-02-01', 1828, 9878) , ('223', '2021-01-01', 1815, 9876) , ('223', '2021-01-01', 1820, 9877) , ('223', '2021-01-01', 1828, 9878) , ('223', '2021-02-01', 1815, 9876) , ('223', '2021-02-01', 1820, 9877) , ('223', '2021-02-01', 1828, 9878)],['number','date', 'sorter', 'key']) df.show() +------+----------+------+----+ |number| date|sorter| key| +------+----------+------+----+ | 123|2021-01-01| 1815|9876| | 123|2021-01-01| 1820|9877| | 123|2021-01-01| 1828|9878| | 123|2021-02-01| 1815|9876| | 123|2021-02-01| 1820|9877| | 123|2021-02-01| 1828|9878| | 223|2021-01-01| 1815|9876| | 223|2021-01-01| 1820|9877| | 223|2021-01-01| 1828|9878| | 223|2021-02-01| 1815|9876| | 223|2021-02-01| 1820|9877| | 223|2021-02-01| 1828|9878| +------+----------+------+----+

This data frame is sorted based on the sorter column

Now using the above data frame I want to create a new data frame. Based on below

1) For each group where number and date is same I want to concatenate the `key` value. 2) In each group the first record will have its own `key` as `joined_key` 3) From second record onwards it should have its own `key` and the `joined_key` of previous record

expected result

df1.show() +------+----------+------+----+---------------+ |number| date|sorter| key| Joined_key| +------+----------+------+----+---------------+ | 123|2021-01-01| 1815|9876| 9876| | 123|2021-01-01| 1820|9877| 9877~9876| | 123|2021-01-01| 1828|9878| 9878~9877~9876| | 123|2021-02-01| 1815|9876| 9876| | 123|2021-02-01| 1820|9877| 9877~9876| | 123|2021-02-01| 1828|9878| 9878~9877~9876| | 223|2021-01-01| 1815|9876| 9876| | 223|2021-01-01| 1820|9877| 9877~9876| | 223|2021-01-01| 1828|9878| 9878~9877~9876| | 223|2021-02-01| 1815|9876| 9876| | 223|2021-02-01| 1820|9877| 9877~9876| | 223|2021-02-01| 1828|9878| 9878~9877~9876| +------+----------+------+----+---------------+

I have tried like below but unable to proceed further

df1 = df.groupby("number", "date").agg(collect_list('key').alias('joined_key')) df1.show() +------+----------+------------------+ |number| date| joined_key| +------+----------+------------------+ | 223|2021-02-01|[9878, 9876, 9877]| | 123|2021-01-01|[9878, 9876, 9877]| | 223|2021-01-01|[9878, 9876, 9877]| | 123|2021-02-01|[9876, 9877, 9878]| +------+----------+------------------+

How can I achieve what I want?

koiralo · Accepted Answer · 2021-05-03 23:22:00Z

You can use Window function with some aggregation as below

window = Window.partitionBy("number", "date").orderBy("sorter") df.withColumn("Joined_key", array_join(reverse(collect_list("key").over(window)), "~")) \ .show(truncate=False)

Result:

+------+----------+------+----+--------------+ |number|date |sorter|key |Joined_key | +------+----------+------+----+--------------+ |223 |2021-02-01|1815 |9876|9876 | |223 |2021-02-01|1820 |9877|9877~9876 | |223 |2021-02-01|1828 |9878|9878~9877~9876| |123 |2021-01-01|1815 |9876|9876 | |123 |2021-01-01|1820 |9877|9877~9876 | |123 |2021-01-01|1828 |9878|9878~9877~9876| |223 |2021-01-01|1815 |9876|9876 | |223 |2021-01-01|1820 |9877|9877~9876 | |223 |2021-01-01|1828 |9878|9878~9877~9876| |123 |2021-02-01|1815 |9876|9876 | |123 |2021-02-01|1820 |9877|9877~9876 | |123 |2021-02-01|1828 |9878|9878~9877~9876| +------+----------+------+----+--------------+

Collectives™ on Stack Overflow

Concatenate row values based on group by in pyspark data frame

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related