specify partitions size with spark

Question

I'm using spark for processing large files, I have 12 partitions. I have rdd1 and rdd2 i make a join between them, than select (rdd3). My problem is, i consulted that the last partition is too big than other partitions, from partition 1 to partitions 11 45000 recodrs but the partition 12 9100000 recodrs. so i divided 9100000 / 45000 =~ 203. i repartition my rdd3 into 214(203+11) but i last partition still too big. How i can balance the size of my partitions ?

My i write my own custom partitioner?

Can you detail what you tried and what feedback you got indicating they did not work? — Scott Hunter
– Scott Hunter, Commented Jul 28, 2017 at 10:15
Also please include the code, so we can see when in the process you are repartitioning. — Glennie Helles Sindholt
– Glennie Helles Sindholt, Commented Jul 28, 2017 at 10:38
need more details on your data. partitioning partitions on keys. If the majority of your keys are the same then one of the partition will be large. — nairbv
– nairbv, Commented Jul 30, 2017 at 2:00

Alper t. Turker · Accepted Answer · 2017-07-28 14:44:27Z

I have rdd1 and rdd2 i make a join between them

join is the most expensive operation is Spark. To be able to join by key, you have to shuffle values, and if keys are not uniformly distributed, you get described behavior. Custom partitioner won't help you in that case.

I'd consider adjusting the logic, so it doesn't require a full join.

Collectives™ on Stack Overflow

specify partitions size with spark

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related