I have a dataframe (df) and within the dataframe I have a column user_id
df = sc.parallelize([(1, "not_set"), (2, "user_001"), (3, "user_002"), (4, "n/a"), (5, "N/A"), (6, "userid_not_set"), (7, "user_003"), (8, "user_004")]).toDF(["key", "user_id"]) df:
+---+--------------+ |key| user_id| +---+--------------+ | 1| not_set| | 2| user_003| | 3| user_004| | 4| n/a| | 5| N/A| | 6|userid_not_set| | 7| user_003| | 8| user_004| +---+--------------+ I would like to replace the following values: not_set, n/a, N/A and userid_not_set with null.
It would be good if I could add any new values to a list and they to could be changed.
I am currently using a CASE statement within spark.sql to preform this and would like to change this to pyspark.