1

I had created a bucketed table using below command in Spark:

df.write.bucketBy(200, "UserID").sortBy("UserID").saveAsTable("topn_bucket_test") 

Size of Table : 50 GB

Then I joined another table (say t2 , size :70 GB)(Bucketed as before ) with above table on UserId column . I found that in the execution plan the table topn_bucket_test was being sorted (but not shuffled) before the join and I expected it to be neither shuffled nor sorted before join as it was bucketed. What can be the reason ? and how to remove sort phase for topn_bucket_test?

2
  • I think sort operations is default even though your tables are bucketBy byUserID. bucketBy only makes sure that your data is not shuffled again during join because key's are already co-located. Commented Aug 30, 2020 at 14:25
  • @kavetiraviteja Is there any way I can remove sort too .As the datasets are already sorted Commented Aug 30, 2020 at 16:13

1 Answer 1

1

As far as I am concerned it is not possible to avoid the sort phase. When using the same bucketBy call it is unlikely that the physical bucketing will be identical in both tables. Imagine the first table having UserID ranging from 1 to 1000 and the second from 1 to 2000. Different UserIDs might end up in the 200 buckets and within those bucket there might be multiple different (and unsorted!) UserIDs.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.