How does Spark reduces shuffle while Joining tables, when underlying tables are bucketed

Question

I have read at Multiple forums that Shuffle is reduced while doing Sort Merge Join when your underlying tables are bucketed and sorted. However, My question is following

A sorted Bucket will only guarantee that data in a bucket is about the same set of keys and data is sorted. Assume We have 2 data frames d1 and d2, both are sorted and bucketed.

Does spark guarantee that bucketx of d1 table containing key1 and key2 data is on the same machine as buckety of d2 table containing key1 and key2?

If bucketx and buckety are guaranteed to be on same machine, then there will be no Exchange across nodes while doing sort-merge join. if they can sit on different machines. then there should be data exchange while doing join.

Please help to understand this concept. Thanks in advance.

Samir Vyas · Accepted Answer · 2020-08-15 16:11:03Z

0

Your understanding is correct. SortMergeJoin requires RangePartitioning of data.

If your dataframes df1 and df2 are already partitioned by a RangePartitioner on key k (which is also used in join) then there would be no extra exchange, otherwise there will be.

answered Aug 15, 2020 at 16:11

Samir Vyas

4512 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Harjeet Kumar Over a year ago

how does this make sure that buckets to be joined are on the same machine. if dataframes are partitioned by range partitions on key k. this will not gauratee that both partitions are on same machine. this will only garuntee that buckets have same keys in them

Samir Vyas Over a year ago

Same range bucket goes on the same partition (machine).

Harjeet Kumar Over a year ago

are u sure about that.. I went through this link and it looked like hashpartitioning is used for sort merge join. github.com/apache/spark/pull/21156/files/…

Collectives™ on Stack Overflow

How does Spark reduces shuffle while Joining tables, when underlying tables are bucketed

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related