Optimize Join of two large pyspark dataframes

Question

I have two large pyspark dataframes df1 and df2 containing GBs of data. The columns in first dataframe are id1, col1. The columns in second dataframe are id2, col2. The dataframes have equal number of rows. Also all values of id1 and id2 are unique. Also all values of id1 correspond to exactly one value id2.

For. first few entries are as for df1 and df2 areas follows

df1:

 id1 | col1 12 | john 23 | chris 35 | david

df2:

id2 | col2 23 | lewis 35 | boon 12 | cena

So I need to join the two dataframes on key id1 and id2. df = df1.join(df2, df1.id1 == df2.id2) I am afraid this may suffer from shuffling. How can I optimize the join operation for this special case?

Shubham Jain · Accepted Answer · 2020-06-10 08:01:43Z

To avoid the shuffling at the time of join operation, reshuffle the data based on your id column.

The reshuffle operation will also do a full shuffle but it will optimize your further joins if there are more than one.

df1 = df1.repartition('id1') df2 = df2.repartition('id2')

Another way to avoid shuffles at join is to leverage bucketing.

Save both the dataframes by using bucketBy clause on id then later when you read the dataframes the id column will reside in same executors, hence avoiding the shuffle.

But to leverage benefit of bucketing, you need a hive metastore as the bucketing information is contained in it.

Also this will include additional steps of creating the bucket then reading.

I have to perform join only once. So will repartitioning or bucketing help because they will shuffle the data anyways?

Collectives™ on Stack Overflow

Optimize Join of two large pyspark dataframes

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related