1

I have a data framew in pyspark like below

df = spark.createDataFrame([('123', '2021-01-01', 1815, 9876), ('123', '2021-01-01', 1820, 9877) , ('123', '2021-01-01', 1828, 9878) , ('123', '2021-02-01', 1815, 9876) , ('123', '2021-02-01', 1820, 9877) , ('123', '2021-02-01', 1828, 9878) , ('223', '2021-01-01', 1815, 9876) , ('223', '2021-01-01', 1820, 9877) , ('223', '2021-01-01', 1828, 9878) , ('223', '2021-02-01', 1815, 9876) , ('223', '2021-02-01', 1820, 9877) , ('223', '2021-02-01', 1828, 9878)],['number','date', 'sorter', 'key']) df.show() +------+----------+------+----+ |number| date|sorter| key| +------+----------+------+----+ | 123|2021-01-01| 1815|9876| | 123|2021-01-01| 1820|9877| | 123|2021-01-01| 1828|9878| | 123|2021-02-01| 1815|9876| | 123|2021-02-01| 1820|9877| | 123|2021-02-01| 1828|9878| | 223|2021-01-01| 1815|9876| | 223|2021-01-01| 1820|9877| | 223|2021-01-01| 1828|9878| | 223|2021-02-01| 1815|9876| | 223|2021-02-01| 1820|9877| | 223|2021-02-01| 1828|9878| +------+----------+------+----+ 

This data frame is sorted based on the sorter column

Now using the above data frame I want to create a new data frame. Based on below

1) For each group where number and date is same I want to concatenate the `key` value. 2) In each group the first record will have its own `key` as `joined_key` 3) From second record onwards it should have its own `key` and the `joined_key` of previous record 

expected result

df1.show() +------+----------+------+----+---------------+ |number| date|sorter| key| Joined_key| +------+----------+------+----+---------------+ | 123|2021-01-01| 1815|9876| 9876| | 123|2021-01-01| 1820|9877| 9877~9876| | 123|2021-01-01| 1828|9878| 9878~9877~9876| | 123|2021-02-01| 1815|9876| 9876| | 123|2021-02-01| 1820|9877| 9877~9876| | 123|2021-02-01| 1828|9878| 9878~9877~9876| | 223|2021-01-01| 1815|9876| 9876| | 223|2021-01-01| 1820|9877| 9877~9876| | 223|2021-01-01| 1828|9878| 9878~9877~9876| | 223|2021-02-01| 1815|9876| 9876| | 223|2021-02-01| 1820|9877| 9877~9876| | 223|2021-02-01| 1828|9878| 9878~9877~9876| +------+----------+------+----+---------------+ 

I have tried like below but unable to proceed further

df1 = df.groupby("number", "date").agg(collect_list('key').alias('joined_key')) df1.show() +------+----------+------------------+ |number| date| joined_key| +------+----------+------------------+ | 223|2021-02-01|[9878, 9876, 9877]| | 123|2021-01-01|[9878, 9876, 9877]| | 223|2021-01-01|[9878, 9876, 9877]| | 123|2021-02-01|[9876, 9877, 9878]| +------+----------+------------------+ 

How can I achieve what I want?

1 Answer 1

1

You can use Window function with some aggregation as below

window = Window.partitionBy("number", "date").orderBy("sorter") df.withColumn("Joined_key", array_join(reverse(collect_list("key").over(window)), "~")) \ .show(truncate=False) 

Result:

+------+----------+------+----+--------------+ |number|date |sorter|key |Joined_key | +------+----------+------+----+--------------+ |223 |2021-02-01|1815 |9876|9876 | |223 |2021-02-01|1820 |9877|9877~9876 | |223 |2021-02-01|1828 |9878|9878~9877~9876| |123 |2021-01-01|1815 |9876|9876 | |123 |2021-01-01|1820 |9877|9877~9876 | |123 |2021-01-01|1828 |9878|9878~9877~9876| |223 |2021-01-01|1815 |9876|9876 | |223 |2021-01-01|1820 |9877|9877~9876 | |223 |2021-01-01|1828 |9878|9878~9877~9876| |123 |2021-02-01|1815 |9876|9876 | |123 |2021-02-01|1820 |9877|9877~9876 | |123 |2021-02-01|1828 |9878|9878~9877~9876| +------+----------+------+----+--------------+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.