PySpark replace null in column with value in other column

Question

I want to replace null values in one column with the values in an adjacent column ,for example if i have

A|B 0,1 2,null 3,null 4,2

I want it to be:

A|B 0,1 2,2 3,3 4,2

Tried with

df.na.fill(df.A,"B")

But didnt work, it says value should be a float, int, long, string, or dict

Any ideas?

Abhishek Gupta · Accepted Answer · 2021-03-22 17:29:44Z

74

We can use coalesce

from pyspark.sql.functions import coalesce df.withColumn("B",coalesce(df.B,df.A))

edited Mar 22, 2021 at 17:29

Abhishek Gupta

4,21626 silver badges27 bronze badges

answered Mar 24, 2017 at 4:33

Luis Leal

3,5545 gold badges32 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user8276908 Over a year ago

This solution is missing a from pyspark.sql.functions import coalesce

Rags · Accepted Answer · 2017-03-24 04:44:05Z

Another Answer.

If the below df1 your dataframe

rd1 = sc.parallelize([(0,1), (2,None), (3,None), (4,2)]) df1 = rd1.toDF(['A', 'B']) from pyspark.sql.functions import when df1.select('A', when( df1.B.isNull(), df1.A).otherwise(df1.B).alias('B') )\ .show()

Pushkr · Accepted Answer · 2017-03-24 03:20:45Z

3

df.rdd.map(lambda row: row if row[1] else Row(a=row[0],b=row[0])).toDF().show()

answered Mar 24, 2017 at 3:20

Pushkr

3,63921 silver badges32 bronze badges

1 Comment

Luis Leal Over a year ago

Thank you, at the end , i used coallesce : df.withColumn("B",coalesce(df.B,df.A)) But your answer is helpful in case anybody else tries this.

Tomasz Bartkowiak · Accepted Answer · 2021-11-16 12:00:23Z

Note: coalesce will not replace NaN values, only nulls:

import pyspark.sql.functions as F >>> cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b")) >>> cDf.show() +----+----+ | a| b| +----+----+ |null|null| | 1|null| |null| 2| +----+----+ >>> cDf.select(F.coalesce(cDf["a"], cDf["b"])).show() +--------------+ |coalesce(a, b)| +--------------+ | null| | 1| | 2| +--------------+

Let's now create a pandas.DataFrame with None entries, convert it into spark.DataFrame and use coalesce again:

>>> cDf_from_pd = spark.createDataFrame(pd.DataFrame({'a': [None, 1, None], 'b': [None, None, 2]})) >>> cDf_from_pd.show() +---+---+ | a| b| +---+---+ |NaN|NaN| |1.0|NaN| |NaN|2.0| +---+---+ >>> cDf_from_pd.select(F.coalesce(cDf_from_pd["a"], cDf_from_pd["b"])).show() +--------------+ |coalesce(a, b)| +--------------+ | NaN| | 1.0| | NaN| +--------------+

In which case you'll need to first call replace on your DataFrame to convert NaNs to nulls.

Collectives™ on Stack Overflow

PySpark replace null in column with value in other column

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Linked

Related