0

I have a Dataframe:

ID | program | --------|-----------| 53-8975 | null | 53-9875 | null | 53A7569 | | 53-9456 | XXXX | 53-9875 | | --------------------- 

The ID and the program are String. I want to fill all null or "" in program column by the letter K and if the 4th digit in the ID column is 9. For example:

I have two ID that there 4th is 9: 53-9875 and 53-9456 and the values of program column is respectively are: null and ""

How can I fill the program column by the letter K if the 4th digit in the ID column is 9 and the program is null or "" using pyspark.

To be my Dataframe:

ID | program | --------|-----------| 53-8975 | null | 53-9875 | K | 53A7569 | | 53-9456 | XXXX | 53-9875 | K | --------------------- 

1 Answer 1

1

So if we have your original dataframe:

df = spark.createDataFrame([("53-8975", None), ("53-9875", None), ("53A7569", ""), ("53-9456", "XXXX"), ("53-9875", "")], ["id", "program"]) df.show() +-------+-------+ | id|program| +-------+-------+ |53-8975| null| |53-9875| null| |53A7569| | |53-9456| XXXX| |53-9875| | +-------+-------+ 

We can create a transformation that takes program or "k" according to your specification with when().otherwise():

from pyspark.sql.functions import * programNullOrEmpty = (col("program") == "") | (isnull(col("program"))) id9 = col("id").substr(4,1) == "9" df.withColumn("program", when(programNullOrEmpty & id9, lit("K")) .otherwise(col("program")))\ .show() +-------+-------+ | id|program| +-------+-------+ |53-8975| null| |53-9875| K| |53A7569| | |53-9456| XXXX| |53-9875| K| +-------+-------+ 
Sign up to request clarification or add additional context in comments.

3 Comments

thank you for your answer, In fact I changed it like this, I used your solution: output = ( df.select( F.col('program'), F.col('ID') .withColumn("program", F.when((F.col("program") == "") | (isnull(F.col("program"))) & (F.col("ID").substr(4,1) == "9"), lit("K")).otherwise(F.col("program"))) ) ) I got this error: TypeError: 'Column' object is not callable Some help please ?
There were issues with parentheses in your modifications, it should work like output = ( df.select( F.col('program'), F.col('ID') ).withColumn("program", F.when(((F.col("program") == "") | (F.isnull(F.col("program")))) & (F.col("ID").substr(4,1) == "9"), F.lit("K")).otherwise(F.col("program"))) )
If you found the answer useful please accept the answer and optionally upvote :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.