Suppose I have the following data set:
a | b 1 | 0.4 1 | 0.8 1 | 0.5 2 | 0.4 2 | 0.1 I would like to add a new column called "label" where the values are determined locally for each group of values in a. The highest value of b in a group a is labeled 1 and all others are labeled 0.
The output would look like this :
a | b | label 1 | 0.4 | 0 1 | 0.8 | 1 1 | 0.5 | 0 2 | 0.4 | 1 2 | 0.1 | 0 How can I do this efficiently using PySpark?