2

I want to replace null values in a dataframe, but only on rows that match an specific criteria.

I have this DataFrame:

A|B |C |D | 1|null|null|null| 2|null|null|null| 2|null|null|null| 2|null|null|null| 5|null|null|null| 

I want to do this:

A|B |C |D | 1|null|null|null| 2|x |x |x | 2|x |x |x | 2|x |x |x | 5|null|null|null| 

My case

So all the rows that have the number 2 in the column A should get replaced.

The columns A, B, C, D are dynamic, they will change in numbers and names.

I also want to be able to select all the rows, not only the replaced ones.

What I tried

I tried with df.where and fillna, but it does not keep all the rows.

I also though about doing with withColumn, but I only know the column A, all the others will change on each execution.

Adapted Solution:

 df.select("A", *[ when(col("A") == '2', coalesce(col(c), lit('0').cast(df.schema[c].dataType)) ).otherwise(col(c)).alias(c) for c in cols_to_replace ]) 

1 Answer 1

4

Use pyspark.sql.functions.when with pyspark.sql.functions.coalesce:

from pyspark.sql.functions import coalesce, col, lit, when cols_to_replace = df.columns[1:] df.select( "A", *[ when(col("A")==2, coalesce(col(c), lit("x"))).otherwise(col(c)).alias(c) for c in cols_to_replace ] ).show() #+---+----+----+----+ #| A| B| C| D| #+---+----+----+----+ #| 1|null|null|null| #| 2| x| x| x| #| 2| x| x| x| #| 2| x| x| x| #| 5|null|null|null| #+---+----+----+----+ 

Inside the list comprehension, you check to see if the value of A is 2. If yes, then you coalesce the value of the column and the literal x. This will replace nulls with x. Otherwise, keep the same column value.

Sign up to request clarification or add additional context in comments.

2 Comments

Incredible how fast you answered! I am just adding that I had problems with dataTypes. So in order to solve that, I replaced 'x' to 0 and used the dataframe schema to cast, to wherever type it is, from inside the coalesce.
df = (df.select("A",*[ when(col("A") == '2', coalesce(col(c), lit('0').cast(df.schema[c].dataType)) ).otherwise(col(c)).alias(c) for c in cols_to_replace ]))

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.