How can I define user-defined aggregate functions in PySpark?

Question

I want make an user defined aggregate function in pyspark. I found some documentation for Scala and would like to achieve something similar in Python.

To be more specific, assume I already have a function like this implemented:

def process_data(df: pyspark.sql.DataFrame) -> bytes: ... # do something very complicated here

and now I would like to be able to do something like:

source_df.groupBy("Foo_ID").agg(UDAF(process_data))

Now the question is - what should I put in place of UDAF?

Does this answer your question? Applying UDFs on GroupedData in PySpark (with functioning python example) — Jonathan
– Jonathan, Commented Sep 23, 2022 at 9:26

Ranga Reddy · Accepted Answer · 2022-09-23 15:10:05Z

PySpark does not support UDAF directly, so we have to do aggregation manually.

Reference:

Collectives™ on Stack Overflow

How can I define user-defined aggregate functions in PySpark?

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related