3

Suppose I have the following data set:

a | b 1 | 0.4 1 | 0.8 1 | 0.5 2 | 0.4 2 | 0.1 

I would like to add a new column called "label" where the values are determined locally for each group of values in a. The highest value of b in a group a is labeled 1 and all others are labeled 0.

The output would look like this :

a | b | label 1 | 0.4 | 0 1 | 0.8 | 1 1 | 0.5 | 0 2 | 0.4 | 1 2 | 0.1 | 0 

How can I do this efficiently using PySpark?

1 Answer 1

7

You can do it with window functions. First you'll need a couple of imports:

from pyspark.sql.functions import desc, row_number, when from pyspark.sql.window import Window 

and window definition:

w = Window().partitionBy("a").orderBy(desc("b")) 

Finally you use these:

df.withColumn("label", when(row_number().over(w) == 1, 1).otherwise(0)) 

For example data:

df = sc.parallelize([ (1, 0.4), (1, 0.8), (1, 0.5), (2, 0.4), (2, 0.1) ]).toDF(["a", "b"]) 

the result is:

+---+---+-----+ | a| b|label| +---+---+-----+ | 1|0.8| 1| | 1|0.5| 0| | 1|0.4| 0| | 2|0.4| 1| | 2|0.1| 0| +---+---+-----+ 
Sign up to request clarification or add additional context in comments.

1 Comment

So what order should your code lines be run in -- as follows? from..., parallelize, window(), withColumn?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.