2

I have read at Multiple forums that Shuffle is reduced while doing Sort Merge Join when your underlying tables are bucketed and sorted. However, My question is following

A sorted Bucket will only guarantee that data in a bucket is about the same set of keys and data is sorted. Assume We have 2 data frames d1 and d2, both are sorted and bucketed.

  1. Does spark guarantee that bucketx of d1 table containing key1 and key2 data is on the same machine as buckety of d2 table containing key1 and key2?

If bucketx and buckety are guaranteed to be on same machine, then there will be no Exchange across nodes while doing sort-merge join. if they can sit on different machines. then there should be data exchange while doing join.

Please help to understand this concept. Thanks in advance.

1 Answer 1

0

Your understanding is correct. SortMergeJoin requires RangePartitioning of data.

If your dataframes df1 and df2 are already partitioned by a RangePartitioner on key k (which is also used in join) then there would be no extra exchange, otherwise there will be.

Sign up to request clarification or add additional context in comments.

3 Comments

how does this make sure that buckets to be joined are on the same machine. if dataframes are partitioned by range partitions on key k. this will not gauratee that both partitions are on same machine. this will only garuntee that buckets have same keys in them
Same range bucket goes on the same partition (machine).
are u sure about that.. I went through this link and it looked like hashpartitioning is used for sort merge join. github.com/apache/spark/pull/21156/files/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.