It's happening because you are not applying partitionBy on key-value pair rdd. yourYour rdd must need to be be in key-value pair. Also you, your key type should be intinteger. I don't have sample data for your hive table. so letsSo let's demonstrate the fact using below hive table:
I have created a below dataframe using hive table :
nowNow, I wish to partition my dataframe and want to keep the similar keys in one partition. so So, I have converted my dataframe to rdd as you can only apply partitionBy on rdd for re-partitioning.
henceHence, we need to convert our rdd into key-value pair to use paritionBy
now youNow,you can see that rdd has been converted to key value pair and you can therefore distribute your data in partitions according to keys available.
usingUsing a paritionBYparitionBy on key-value rdd now:
Lets take a look at the partitions. dataData is grouped and similar keys are stored into similar partitions now. twoTwo of them are empty.
Now lets say I want to custom partitioning my data. soSo I have created below function to keep keys '1' and '3' in similar partition.
asAs you can see now that keys 1 and 3 are stored in one partition and rest on other.
I hope this helphelps. youYou can try to paritionBypartitionBy your dataframe. makeMake sure to convert it into key value pair and keeping key as type intinteger.