How do I replace a string value with a NULL in PySpark for all my columns in the dataframe?

Question

As an example say I have a df

from pyspark.sql import Row row = Row("v", "x", "y", "z") df = sc.parallelize([ row("p", 1, 2, 3.0), row("NULL", 3, "NULL", 5.0), row("NA", None, 6, 7.0), row(float("Nan"), 8, "NULL", float("NaN")) ]).toDF()

Now I want to replace NULL, NA and NaN by pyspark null (None) value. How do I achieve it for multiple columns together.

from pyspark.sql.functions import when, lit, col def replace(column, value): return when(column != value, column).otherwise(lit(None)) df = df.withColumn("v", replace(col("v"), "NULL")) df = df.withColumn("v", replace(col("v"), "NaN")) df = df.withColumn("v", replace(col("v"), "NaN"))

Writing this for all columns is something I am trying to avoid as I can have any number of columns in my dataframe.

Appreciate your help. Thanks!

akuiper · Accepted Answer · 2017-12-06 15:17:22Z

Loop through the columns, construct the column expressions that replace specific strings with null, then select the columns:

df.show() +----+----+----+---+ | v| x| y| z| +----+----+----+---+ | p| 1| 2|3.0| |NULL| 3|null|5.0| | NA|null| 6|7.0| | NaN| 8|null|NaN| +----+----+----+---+ import pyspark.sql.functions as F cols = [F.when(~F.col(x).isin("NULL", "NA", "NaN"), F.col(x)).alias(x) for x in df.columns] df.select(*cols).show() +----+----+----+----+ | v| x| y| z| +----+----+----+----+ | p| 1| 2| 3.0| |null| 3|null| 5.0| |null|null| 6| 7.0| |null| 8|null|null| +----+----+----+----+

could there be an explanation of how " cols = [F.when(~F.col(x).isin("NULL", "NA", "NaN"), F.col(x)).alias(x) for x in df.columns] " works?

Collectives™ on Stack Overflow

How do I replace a string value with a NULL in PySpark for all my columns in the dataframe?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related