2

I want make an user defined aggregate function in pyspark. I found some documentation for Scala and would like to achieve something similar in Python.

To be more specific, assume I already have a function like this implemented:

def process_data(df: pyspark.sql.DataFrame) -> bytes: ... # do something very complicated here 

and now I would like to be able to do something like:

source_df.groupBy("Foo_ID").agg(UDAF(process_data)) 

Now the question is - what should I put in place of UDAF?

1

1 Answer 1

2

PySpark does not support UDAF directly, so we have to do aggregation manually.

Reference:

  1. https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html

  2. How to write Pyspark UDAF on multiple columns?

  3. Applying UDFs on GroupedData in PySpark (with functioning python example)

  4. https://florianwilhelm.info/2017/10/efficient_udfs_with_pyspark/

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.