The question title might be too implicit. Let's say we have a spark data frame:
user_ID phone_number -------------------------------- A 1234567 B 1234567 C 8888888 D 9999999 E 1234567 F 8888888 G 1234567 And we need to count, for each user_ID, how many user_ID's share the same phone_number with it. For table listed before, the desired result should be:
user_ID count_of_userID_who_share_the_same_phone_number ---------------------------------------------------------------- A 4 B 4 C 2 D 1 E 4 F 2 G 4 It can be achieved by writing self-join queries in spark.sql(query), but the the performance is quite heart-breaking. Any suggestion how I can get a much faster implementation? Thanks :)