I'm coding in PySpark and have a data frame that has tokens and their associated phrases. The same phrase can appear in multiple rows so I want to groupby so that there is only one row of the phrase, but I want to keep the one that has an associated descriptor. If there is no descriptor, I want to keep one row with the null. Example data set:
+------------------------------------+--------+-------+---------+------------+-----------+ | SENTENCE | SENT_ID| TOKEN| TOKEN_ID| PHRASE | DESCRIPTOR| +------------------------------------+--------+-------+---------+------------+-----------+ |The handle of the old razor blade...| 1| handle| 2| handle| null| |The handle of the old razor blade...| 1| razor| 6| razor blade| null| |The handle of the old razor blade...| 1| blade| 7| razor blade| old| I want it to look like:
+------------------------------------+--------+------------+-----------+ | SENTENCE | SENT_ID| PHRASE | DESCRIPTOR| +------------------------------------+--------+------------+-----------+ |The handle of the old razor blade...| 1| handle| null| |The handle of the old razor blade...| 1| razor blade| old| There will never be a situation where there are different descriptors for the same phrase. I'm thinking something like df.groupby('REVIEW_ID','SENT_ID','PHRASE') but not sure how to bring in the descriptor.