2

I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected. Example:

DataFrame columns: customer_id|col_a|col_b|col_c|col_d

Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0

enter image description here

etc...

reason_for_exclusion can be any string or letter as long as it says why particular row excluded.

I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so just checking is there any better way to this operation?

Thanks

1 Answer 1

2

You'll have to check for each condition within the filter expression which can be expensive regarding the simple operation of filtering. I would suggest displaying the same reason for all filtred rows since it satisfies at least one condition in that expression. It's not pretty but I'd prefer this as it's efficient especially when you have to handle very large DataFrames.

data = [(1, 1, 5, -3, 0),(2, 0, 10, -1, 0), (3, 0, 10, -4, 1),] df = spark.createDataFrame(data, ["customer_id", "col_a", "col_b", "col_c", "col_d"]) filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0" filtered_df = df.withColumn("reason_for_exclusion", when(~expr(filter_expr),lit(filter_expr) ).otherwise(lit(None)) ) filtered_df.show(truncate=False) 

Output:

+-----------+-----+-----+-----+-----+-------------------------------------------------+ |customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion | +-----------+-----+-----+-----+-----+-------------------------------------------------+ |1 |1 |5 |-3 |0 |null | |2 |0 |10 |-1 |0 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0| |3 |0 |10 |-4 |1 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0| +-----------+-----+-----+-----+-----+-------------------------------------------------+ 

EDIT:

Now, if you really want to display only the conditions which failed you can turn each condition to separated columns and use DataFrame select to do the calculation. Then you'll have to check columns evaluated to False to know which condition has failed.

You could name these columns by <PREFIX>_<condition> so that you could identify them easily later. Here is a complete example:

filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0" COLUMN_FILTER_PREFIX = "filter_validation_" original_columns = [col(c) for c in df.columns] # create column for each condition in filter expression condition_columns = [expr(f).alias(COLUMN_FILTER_PREFIX + f) for f in filter_expr.split("AND")] # evaluate condition to True/False and persist the DF with calculated columns filtered_df = df.select(original_columns + condition_columns) filtered_df = filtered_df.persist(StorageLevel.MEMORY_AND_DISK) # get back columns we calculated for filter filter_col_names = [c for c in filtered_df.columns if COLUMN_FILTER_PREFIX in c] filter_columns = list() for c in filter_col_names: filter_columns.append( when(~col(f"`{c}`"), lit(f"{c.replace(COLUMN_FILTER_PREFIX, '')}") ) ) array_reason_filter = array_except(array(*filter_columns), array(lit(None))) df_with_filter_reason = filtered_df.withColumn("reason_for_exclusion", array_reason_filter) df_with_filter_reason.select(*original_columns, col("reason_for_exclusion")).show(truncate=False) # output +-----------+-----+-----+-----+-----+----------------------+ |customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion | +-----------+-----+-----+-----+-----+----------------------+ |1 |1 |5 |-3 |0 |[] | |2 |0 |10 |-1 |0 |[col_a > 0 ] | |3 |0 |10 |-4 |1 |[col_a > 0 , col_d=0]| +-----------+-----+-----+-----+-----+----------------------+ 
Sign up to request clarification or add additional context in comments.

2 Comments

I understand but this does not tell which filter condition failed.
@SparkScalaDeveloper Yeah you'll have to check the entire line. Generally, we save the filtered rows in another file, you could read it and find out reasons easily. Anyway, I added a full example for what you're looking for, see my update.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.