1

I have a pyspark dataframe like this:

+--------------------+--------------------+ | name| value| +--------------------+--------------------+ | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | id| null| | name| null| | age| null| | food| null| | null| 1| | null| Joe| | null| 47| | null| pizza| +--------------------+--------------------+

I want to remove the null values from each individual columns so the non-null data lines up.

The desired output is:

+--------------------+--------------------+ | name| value| +--------------------+--------------------+ | id| 1| | name| Joe| | age| 47| | food| pizza| +--------------------+--------------------+

I have tried removing nulls doing something like df.dropna(how='any'/'all') but and by separating out the columns and removing the nulls, but then it becomes difficult to join them back together.

1 Answer 1

1

try this- written in scala, but can be ported to pyspark with minimal change

 df.select(map_from_arrays(collect_list("name").as("name"), collect_list("value").as("value")).as("map")) .select(explode_outer($"map").as(Seq("name", "value"))) .show(false) /** * +----+-----+ * |name|value| * +----+-----+ * |id |1 | * |name|Joe | * |age |47 | * |food|pizza| * +----+-----+ */ 
Sign up to request clarification or add additional context in comments.

1 Comment

pyspark version of the same (df.select(F.map_from_arrays(F.collect_list("name"),F.collect_list("value")).alias("map")) .select(F.explode_outer("map").alias("name","value"))).show() , very nicely done, learned something new.. +1

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.