I'm using spark for processing large files, I have 12 partitions. I have rdd1 and rdd2 i make a join between them, than select (rdd3). My problem is, i consulted that the last partition is too big than other partitions, from partition 1 to partitions 11 45000 recodrs but the partition 12 9100000 recodrs. so i divided 9100000 / 45000 =~ 203. i repartition my rdd3 into 214(203+11) but i last partition still too big. How i can balance the size of my partitions ?
My i write my own custom partitioner?
repartitionandpartitionBy