I want to replace null values in a dataframe, but only on rows that match an specific criteria.
I have this DataFrame:
A|B |C |D | 1|null|null|null| 2|null|null|null| 2|null|null|null| 2|null|null|null| 5|null|null|null| I want to do this:
A|B |C |D | 1|null|null|null| 2|x |x |x | 2|x |x |x | 2|x |x |x | 5|null|null|null| My case
So all the rows that have the number 2 in the column A should get replaced.
The columns A, B, C, D are dynamic, they will change in numbers and names.
I also want to be able to select all the rows, not only the replaced ones.
What I tried
I tried with df.where and fillna, but it does not keep all the rows.
I also though about doing with withColumn, but I only know the column A, all the others will change on each execution.
Adapted Solution:
df.select("A", *[ when(col("A") == '2', coalesce(col(c), lit('0').cast(df.schema[c].dataType)) ).otherwise(col(c)).alias(c) for c in cols_to_replace ])