PySpark GroupBy - Keep Value or Null if No Value

Question

I'm coding in PySpark and have a data frame that has tokens and their associated phrases. The same phrase can appear in multiple rows so I want to groupby so that there is only one row of the phrase, but I want to keep the one that has an associated descriptor. If there is no descriptor, I want to keep one row with the null. Example data set:

 +------------------------------------+--------+-------+---------+------------+-----------+ | SENTENCE | SENT_ID| TOKEN| TOKEN_ID| PHRASE | DESCRIPTOR| +------------------------------------+--------+-------+---------+------------+-----------+ |The handle of the old razor blade...| 1| handle| 2| handle| null| |The handle of the old razor blade...| 1| razor| 6| razor blade| null| |The handle of the old razor blade...| 1| blade| 7| razor blade| old|

I want it to look like:

+------------------------------------+--------+------------+-----------+ | SENTENCE | SENT_ID| PHRASE | DESCRIPTOR| +------------------------------------+--------+------------+-----------+ |The handle of the old razor blade...| 1| handle| null| |The handle of the old razor blade...| 1| razor blade| old|

There will never be a situation where there are different descriptors for the same phrase. I'm thinking something like df.groupby('REVIEW_ID','SENT_ID','PHRASE') but not sure how to bring in the descriptor.

Show what you have tried. What are you expecting to happen if there are 2 non-null values? — AChampion
– AChampion, Commented Aug 11, 2020 at 13:21
Not sure what to try. There is never a situation where there are 2 non-nulls — user3242036
– user3242036, Commented Aug 11, 2020 at 13:35

notNull · Accepted Answer · 2020-08-11 14:37:53Z

Use collect_list or collect_set functions to get descriptor values.

collect_list,collect_set doesn't preserve null values for this case use when otherwise to replace with string null.

Example:

df.show() #+---+----+------+ #| id|name|salary| #+---+----+------+ #| 1| a| 100| #| 1|null| 200| #| 1|null| 300| #+---+----+------+ #grouping by id and collecting names df.groupBy("id").agg(collect_list(col("name")).alias("list")).show() #+---+----+ #| id|list| #+---+----+ #| 1| [a]| #+---+----+ #preserve nulls without duplicates df.groupBy("id").\ agg(concat_ws(",",collect_list(when(isnull(col("name")),lit('null')).otherwise(col("name")))).alias("list")).\ show() #+---+-----------+ #| id| list| #+---+-----------+ #| 1|a,null,null| #+---+-----------+ #preserve nulls without duplicates df.groupBy("id").\ agg(concat_ws(",",collect_set(when(isnull(col("name")),lit('null')).otherwise(col("name")))).alias("list")).\ show() +---+------+ | id| list| +---+------+ | 1|a,null| +---+------+

So in this situation I would want the first one where id 1 only returns a for list. But if there were an id 2 and it only had null values, then I would want to return just 1 null value
use collect_set after groupBy instead of collect_list. Check my answer I added full syntax in #preserve nulls without duplicates section.
But in your example I would not want the null to be preserved for id 1 because it has a. I would only want the null if there were no other non-null values there
Used your first example, looks to work for nulls by just having an empty list. Thanks

Collectives™ on Stack Overflow

PySpark GroupBy - Keep Value or Null if No Value

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related