How to remove sort phase in spark dataframe join?

Question

I had created a bucketed table using below command in Spark:

df.write.bucketBy(200, "UserID").sortBy("UserID").saveAsTable("topn_bucket_test")

Size of Table : 50 GB

Then I joined another table (say t2 , size :70 GB)(Bucketed as before ) with above table on UserId column . I found that in the execution plan the table topn_bucket_test was being sorted (but not shuffled) before the join and I expected it to be neither shuffled nor sorted before join as it was bucketed. What can be the reason ? and how to remove sort phase for topn_bucket_test?

I think sort operations is default even though your tables are bucketBy byUserID. bucketBy only makes sure that your data is not shuffled again during join because key's are already co-located. — kavetiraviteja
– kavetiraviteja, Commented Aug 30, 2020 at 14:25
@kavetiraviteja Is there any way I can remove sort too .As the datasets are already sorted — Rajesh Kumar Dash
– Rajesh Kumar Dash, Commented Aug 30, 2020 at 16:13

Michael Heil · Accepted Answer · 2020-08-30 21:15:10Z

As far as I am concerned it is not possible to avoid the sort phase. When using the same bucketBy call it is unlikely that the physical bucketing will be identical in both tables. Imagine the first table having UserID ranging from 1 to 1000 and the second from 1 to 2000. Different UserIDs might end up in the 200 buckets and within those bucket there might be multiple different (and unsorted!) UserIDs.

Collectives™ on Stack Overflow

How to remove sort phase in spark dataframe join?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related