I have two large pyspark dataframes df1 and df2 containing GBs of data. The columns in first dataframe are id1, col1. The columns in second dataframe are id2, col2. The dataframes have equal number of rows. Also all values of id1 and id2 are unique. Also all values of id1 correspond to exactly one value id2.
For. first few entries are as for df1 and df2 areas follows
df1:
id1 | col1 12 | john 23 | chris 35 | david df2:
id2 | col2 23 | lewis 35 | boon 12 | cena So I need to join the two dataframes on key id1 and id2. df = df1.join(df2, df1.id1 == df2.id2) I am afraid this may suffer from shuffling. How can I optimize the join operation for this special case?