As an example say I have a df
from pyspark.sql import Row row = Row("v", "x", "y", "z") df = sc.parallelize([ row("p", 1, 2, 3.0), row("NULL", 3, "NULL", 5.0), row("NA", None, 6, 7.0), row(float("Nan"), 8, "NULL", float("NaN")) ]).toDF() Now I want to replace NULL, NA and NaN by pyspark null (None) value. How do I achieve it for multiple columns together.
from pyspark.sql.functions import when, lit, col def replace(column, value): return when(column != value, column).otherwise(lit(None)) df = df.withColumn("v", replace(col("v"), "NULL")) df = df.withColumn("v", replace(col("v"), "NaN")) df = df.withColumn("v", replace(col("v"), "NaN")) Writing this for all columns is something I am trying to avoid as I can have any number of columns in my dataframe.
Appreciate your help. Thanks!