2

I have a dataset

col_id col_2 col_3 col_id_b ABC111 shfhs 34775 null ABC112 shfhe 34775 DEF345 ABC112 shfhs 34775 GFR563 ABC112 shfgh 34756 TRS572 ABC113 shfdh 34795 null ABC114 shfhs 34770 null 

I am trying to create a new column that is identical to col_id_b, except that the nulls take the value of the corresponding col_id from that row. So:

col_id col_2 col_3 col_id_b col_new ABC111 shfhs 34775 null ABC111 ABC112 shfhe 34775 DEF345 DEF345 ABC112 shfhs 34775 GFR563 GFR563 ABC112 shfgh 34756 TRS572 TRS572 ABC113 shfdh 34795 null ABC113 ABC114 shfhs 34770 null ABC114 

I know about:

df.select(coalesce(df["col_id"], df["col_id_b"])).show() 

But in my case there are my rows where both are not-null. How do I introduce this condition?

1 Answer 1

2

Just invert the order of the columns:

df.select(coalesce(col('col_id_b'), col('col_id'))) 

coalesce returns the first column that is not null; so if you specify col_id_b first, it this is not null, you will have col_id_b, otherwise col_id.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you! It says column name coalesce('col_id_b', 'col_id') contains invalid characters and I need an alias to rename it, but I don't see why.
Me neither honestly... What is the exact error stacktrace?
I solved it with: df = df.withColumn("new_col", coalesce(col("col_id_b"),col("col_id")))
Ok, in my tests the col function was not needed, but of course the script settings and previous code were from yours. I'll edit my answer with your change so that future users do not bump into the same issue. Glad to have helped!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.