4

Is there a way to select the entire row as a column to input into a Pyspark filter udf?

I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame:

my_filter_udf = udf(lambda r: my_filter(r), BooleanType()) new_df = df.filter(my_filter_udf(col("*")) 

But

col("*") 

throws an error because that's not a valid operation.

I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back into a dataframe. My DataFrame has complex nested types, so the schema inference fails when I try to convert the RDD into a dataframe again.

1 Answer 1

14

You should write all columns staticly. For example:

from pyspark.sql import functions as F # create sample df df = sc.parallelize([ (1, 'b'), (1, 'c'), ]).toDF(["id", "category"]) #simple filter function @F.udf(returnType=BooleanType()) def my_filter(col1, col2): return (col1>0) & (col2=="b") df.filter(my_filter('id', 'category')).show() 

Results:

+---+--------+ | id|category| +---+--------+ | 1| b| +---+--------+ 

If you have so many columns and you are sure to order of columns:

cols = df.columns df.filter(my_filter(*cols)).show() 

Yields the same output.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for this clean solution, however we cannot do this if we have many columns. I am operating with a dataframe with 100 columns. Could you please help with that case.
@pnv you have to iterate your schema and add them to the record

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.