0

I'm using spark for processing large files, I have 12 partitions. I have rdd1 and rdd2 i make a join between them, than select (rdd3). My problem is, i consulted that the last partition is too big than other partitions, from partition 1 to partitions 11 45000 recodrs but the partition 12 9100000 recodrs. so i divided 9100000 / 45000 =~ 203. i repartition my rdd3 into 214(203+11) but i last partition still too big. How i can balance the size of my partitions ?

My i write my own custom partitioner?

5
  • yes repartition and partitionBy Commented Jul 28, 2017 at 10:14
  • Can you detail what you tried and what feedback you got indicating they did not work? Commented Jul 28, 2017 at 10:15
  • Also please include the code, so we can see when in the process you are repartitioning. Commented Jul 28, 2017 at 10:38
  • my i write my own custom partitioner? Commented Jul 28, 2017 at 10:44
  • need more details on your data. partitioning partitions on keys. If the majority of your keys are the same then one of the partition will be large. Commented Jul 30, 2017 at 2:00

1 Answer 1

1

I have rdd1 and rdd2 i make a join between them

join is the most expensive operation is Spark. To be able to join by key, you have to shuffle values, and if keys are not uniformly distributed, you get described behavior. Custom partitioner won't help you in that case.

I'd consider adjusting the logic, so it doesn't require a full join.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.