How to count the number of values based on another column?

Question

The question title might be too implicit. Let's say we have a spark data frame:

user_ID phone_number -------------------------------- A 1234567 B 1234567 C 8888888 D 9999999 E 1234567 F 8888888 G 1234567

And we need to count, for each user_ID, how many user_ID's share the same phone_number with it. For table listed before, the desired result should be:

user_ID count_of_userID_who_share_the_same_phone_number ---------------------------------------------------------------- A 4 B 4 C 2 D 1 E 4 F 2 G 4

It can be achieved by writing self-join queries in spark.sql(query), but the the performance is quite heart-breaking. Any suggestion how I can get a much faster implementation? Thanks :)

Leo C · Accepted Answer · 2017-07-06 18:35:50Z

Using Spark's Window function should perform significantly better than self-join:

val df = Seq( ("A", "1234567"), ("B", "1234567"), ("C", "8888888"), ("D", "9999999"), ("E", "1234567"), ("F", "8888888"), ("G", "1234567") ).toDF( "user_id", "phone_number" ) // Add phone number count via window function import org.apache.spark.sql.expressions.Window val df2 = df.withColumn("count", count("user_id").over( Window.partitionBy("phone_number") )).orderBy("user_id") df2.show +-------+------------+-----+ |user_id|phone_number|count| +-------+------------+-----+ | A| 1234567| 4| | B| 1234567| 4| | C| 8888888| 2| | D| 9999999| 1| | E| 1234567| 4| | F| 8888888| 2| | G| 1234567| 4| +-------+------------+-----+

Just tried, much much faster than simple spark.sql queries! thx

Collectives™ on Stack Overflow

How to count the number of values based on another column?

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related