I have a data framew in pyspark like below
df = spark.createDataFrame([('123', '2021-01-01', 1815, 9876), ('123', '2021-01-01', 1820, 9877) , ('123', '2021-01-01', 1828, 9878) , ('123', '2021-02-01', 1815, 9876) , ('123', '2021-02-01', 1820, 9877) , ('123', '2021-02-01', 1828, 9878) , ('223', '2021-01-01', 1815, 9876) , ('223', '2021-01-01', 1820, 9877) , ('223', '2021-01-01', 1828, 9878) , ('223', '2021-02-01', 1815, 9876) , ('223', '2021-02-01', 1820, 9877) , ('223', '2021-02-01', 1828, 9878)],['number','date', 'sorter', 'key']) df.show() +------+----------+------+----+ |number| date|sorter| key| +------+----------+------+----+ | 123|2021-01-01| 1815|9876| | 123|2021-01-01| 1820|9877| | 123|2021-01-01| 1828|9878| | 123|2021-02-01| 1815|9876| | 123|2021-02-01| 1820|9877| | 123|2021-02-01| 1828|9878| | 223|2021-01-01| 1815|9876| | 223|2021-01-01| 1820|9877| | 223|2021-01-01| 1828|9878| | 223|2021-02-01| 1815|9876| | 223|2021-02-01| 1820|9877| | 223|2021-02-01| 1828|9878| +------+----------+------+----+ This data frame is sorted based on the sorter column
Now using the above data frame I want to create a new data frame. Based on below
1) For each group where number and date is same I want to concatenate the `key` value. 2) In each group the first record will have its own `key` as `joined_key` 3) From second record onwards it should have its own `key` and the `joined_key` of previous record expected result
df1.show() +------+----------+------+----+---------------+ |number| date|sorter| key| Joined_key| +------+----------+------+----+---------------+ | 123|2021-01-01| 1815|9876| 9876| | 123|2021-01-01| 1820|9877| 9877~9876| | 123|2021-01-01| 1828|9878| 9878~9877~9876| | 123|2021-02-01| 1815|9876| 9876| | 123|2021-02-01| 1820|9877| 9877~9876| | 123|2021-02-01| 1828|9878| 9878~9877~9876| | 223|2021-01-01| 1815|9876| 9876| | 223|2021-01-01| 1820|9877| 9877~9876| | 223|2021-01-01| 1828|9878| 9878~9877~9876| | 223|2021-02-01| 1815|9876| 9876| | 223|2021-02-01| 1820|9877| 9877~9876| | 223|2021-02-01| 1828|9878| 9878~9877~9876| +------+----------+------+----+---------------+ I have tried like below but unable to proceed further
df1 = df.groupby("number", "date").agg(collect_list('key').alias('joined_key')) df1.show() +------+----------+------------------+ |number| date| joined_key| +------+----------+------------------+ | 223|2021-02-01|[9878, 9876, 9877]| | 123|2021-01-01|[9878, 9876, 9877]| | 223|2021-01-01|[9878, 9876, 9877]| | 123|2021-02-01|[9876, 9877, 9878]| +------+----------+------------------+ How can I achieve what I want?