Removing nulls from Pyspark Dataframe in individual columns

Question

I have a pyspark dataframe like this:

I want to remove the null values from each individual columns so the non-null data lines up.

The desired output is:

+--------------------+--------------------+ | name| value| +--------------------+--------------------+ | id| 1| | name| Joe| | age| 47| | food| pizza| +--------------------+--------------------+

I have tried removing nulls doing something like df.dropna(how='any'/'all') but and by separating out the columns and removing the nulls, but then it becomes difficult to join them back together.

Som · Accepted Answer · 2020-06-16 05:39:55Z

try this- written in scala, but can be ported to pyspark with minimal change

 df.select(map_from_arrays(collect_list("name").as("name"), collect_list("value").as("value")).as("map")) .select(explode_outer($"map").as(Seq("name", "value"))) .show(false) /** * +----+-----+ * |name|value| * +----+-----+ * |id |1 | * |name|Joe | * |age |47 | * |food|pizza| * +----+-----+ */

pyspark version of the same (df.select(F.map_from_arrays(F.collect_list("name"),F.collect_list("value")).alias("map")) .select(F.explode_outer("map").alias("name","value"))).show() , very nicely done, learned something new.. +1

Collectives™ on Stack Overflow

Removing nulls from Pyspark Dataframe in individual columns

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related