2

I am trying to create a view on top of two tables.

Table 1: Partitioned by col1 Bucketed by col2 (no of buckets: 3600)

Table 2: Partitioned by col1 Bucketed by col2 ( no of buckets:3600)

View: Table1 Join Table2 On col1=col1 And col2=col2

Here,

When I run query on top of this view, data for both tables are shuffled by hashpartitioning(col1,col2)

I am wondering why this should happen. My understanding is since data is already partitioned and bucketed in equal no of partitions/buckets, spark should know where data exists and it should just do sort-merge-join without shuffling.

Can anyone help me understand why this happens?

I tried to set various spark properties related bucketing, but still it shuffles data.

1
  • I think it would need to repartition it by the two columns you are joining on, since bucketing probably does not help it there. Commented Dec 1, 2023 at 10:36

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.