3

I'm coding in PySpark and have a data frame that has tokens and their associated phrases. The same phrase can appear in multiple rows so I want to groupby so that there is only one row of the phrase, but I want to keep the one that has an associated descriptor. If there is no descriptor, I want to keep one row with the null. Example data set:

 +------------------------------------+--------+-------+---------+------------+-----------+ | SENTENCE | SENT_ID| TOKEN| TOKEN_ID| PHRASE | DESCRIPTOR| +------------------------------------+--------+-------+---------+------------+-----------+ |The handle of the old razor blade...| 1| handle| 2| handle| null| |The handle of the old razor blade...| 1| razor| 6| razor blade| null| |The handle of the old razor blade...| 1| blade| 7| razor blade| old| 

I want it to look like:

+------------------------------------+--------+------------+-----------+ | SENTENCE | SENT_ID| PHRASE | DESCRIPTOR| +------------------------------------+--------+------------+-----------+ |The handle of the old razor blade...| 1| handle| null| |The handle of the old razor blade...| 1| razor blade| old| 

There will never be a situation where there are different descriptors for the same phrase. I'm thinking something like df.groupby('REVIEW_ID','SENT_ID','PHRASE') but not sure how to bring in the descriptor.

2
  • 1
    Show what you have tried. What are you expecting to happen if there are 2 non-null values? Commented Aug 11, 2020 at 13:21
  • Not sure what to try. There is never a situation where there are 2 non-nulls Commented Aug 11, 2020 at 13:35

1 Answer 1

3

Use collect_list or collect_set functions to get descriptor values.

  • collect_list,collect_set doesn't preserve null values for this case use when otherwise to replace with string null.

Example:

df.show() #+---+----+------+ #| id|name|salary| #+---+----+------+ #| 1| a| 100| #| 1|null| 200| #| 1|null| 300| #+---+----+------+ #grouping by id and collecting names df.groupBy("id").agg(collect_list(col("name")).alias("list")).show() #+---+----+ #| id|list| #+---+----+ #| 1| [a]| #+---+----+ #preserve nulls without duplicates df.groupBy("id").\ agg(concat_ws(",",collect_list(when(isnull(col("name")),lit('null')).otherwise(col("name")))).alias("list")).\ show() #+---+-----------+ #| id| list| #+---+-----------+ #| 1|a,null,null| #+---+-----------+ #preserve nulls without duplicates df.groupBy("id").\ agg(concat_ws(",",collect_set(when(isnull(col("name")),lit('null')).otherwise(col("name")))).alias("list")).\ show() +---+------+ | id| list| +---+------+ | 1|a,null| +---+------+ 
Sign up to request clarification or add additional context in comments.

4 Comments

So in this situation I would want the first one where id 1 only returns a for list. But if there were an id 2 and it only had null values, then I would want to return just 1 null value
use collect_set after groupBy instead of collect_list. Check my answer I added full syntax in #preserve nulls without duplicates section.
But in your example I would not want the null to be preserved for id 1 because it has a. I would only want the null if there were no other non-null values there
Used your first example, looks to work for nulls by just having an empty list. Thanks

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.