Pyspark - collapse all columns in dataframe by group variable

Question

I have a dataset that looks something like below.

df = spark.createDataFrame( [ ("001A", 105, "foo", to_date("2022-01-01", "yyyy-MM-dd")), ("001A", 25, "foo", to_date("2022-01-01", "yyyy-MM-dd")), ("002B", 85, "bar", to_date("2022-02-05", "yyyy-MM-dd")), ("002B", 15, "bar", to_date("2022-02-05", "yyyy-MM-dd")), ], ["id", "num_col1", "str_col1", "date_col1"] ) df.show() +----+--------+--------+----------+ | id|num_col1|str_col1| date_col1| +----+--------+--------+----------+ |001A| 105| foo|2022-01-01| |001A| 25| foo|2022-01-01| |002B| 85| bar|2022-01-15| |002B| 15| bar|2022-01-15| +----+--------+--------+----------+

What I wanted to achieve is an aggregated form of df whereby I group by values in id and then aggregate across all the remaining columns in the dataframe. Therefore the resulting dataframe would look like this

+----+--------+--------+----------+ | id|num_col1|str_col1| date_col1| +----+--------+--------+----------+ |001A| 130| foo|2022-01-01| |002B| 100| bar|2022-01-15| +----+--------+--------+----------+

The dataframe contains a mixture of:

numeric columns - which need to be summed
string colums - which are always the same between groups - so just need to take the existing value
date columns - which are also always the same between groups - so just need to take the existing value

The dataframe also contains many, many more columns so any method that involves writing out every single column will not work.

I have looked quite comprehensively across the net but have not found any similar questions or solutions that I have been able to modify to get it to work on my data.

I am quite new to PySpark so my attempts have been pretty futile but I have tried to use the collect_set function to collapse down each row into the groups, with the intention of applying a map function like here Merge multiple spark rows to one, however, it's been pretty unsuccessful.

过过招 · Accepted Answer · 2022-02-25 09:32:37Z

You can use dtypes to classify, group by string and date type columns, and aggregate numeric columns respectively.

df = df.groupBy(*[t[0] for t in df.dtypes if t[1] in ('string', 'date')]) \ .agg(*[F.sum(t[0]).alias(t[0]) for t in df.dtypes if t[1] not in ('string', 'date')]) df.printSchema() df.show(truncate=False)

Thank you for this! Used your template to get 90% of the way there. Learning about the * operator was a bonus.

Collectives™ on Stack Overflow

Pyspark - collapse all columns in dataframe by group variable

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related