1

The question title might be too implicit. Let's say we have a spark data frame:

user_ID phone_number -------------------------------- A 1234567 B 1234567 C 8888888 D 9999999 E 1234567 F 8888888 G 1234567 

And we need to count, for each user_ID, how many user_ID's share the same phone_number with it. For table listed before, the desired result should be:

user_ID count_of_userID_who_share_the_same_phone_number ---------------------------------------------------------------- A 4 B 4 C 2 D 1 E 4 F 2 G 4 

It can be achieved by writing self-join queries in spark.sql(query), but the the performance is quite heart-breaking. Any suggestion how I can get a much faster implementation? Thanks :)

1 Answer 1

4

Using Spark's Window function should perform significantly better than self-join:

val df = Seq( ("A", "1234567"), ("B", "1234567"), ("C", "8888888"), ("D", "9999999"), ("E", "1234567"), ("F", "8888888"), ("G", "1234567") ).toDF( "user_id", "phone_number" ) // Add phone number count via window function import org.apache.spark.sql.expressions.Window val df2 = df.withColumn("count", count("user_id").over( Window.partitionBy("phone_number") )).orderBy("user_id") df2.show +-------+------------+-----+ |user_id|phone_number|count| +-------+------------+-----+ | A| 1234567| 4| | B| 1234567| 4| | C| 8888888| 2| | D| 9999999| 1| | E| 1234567| 4| | F| 8888888| 2| | G| 1234567| 4| +-------+------------+-----+ 
Sign up to request clarification or add additional context in comments.

2 Comments

Just tried, much much faster than simple spark.sql queries! thx
Glad that it helps.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.