In many posts there is the statement - as shown below in some form or another - due to some question on shuffling, partitioning, due to JOIN, AGGR, whatever, etc.:
... In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions = 200. This is set by spark.sql.shuffle.partitions. ...
So, my question is:
- Do we mean that if we have set partitioning at 765 for a DF, for example,
- That the processing occurs against 765 partitions, but that the output is coalesced / re-partitioned standardly to 200 - referring here to word resulting?
- Or does it do the processing using 200 partitions after coalescing / re-partitioning to 200 partitions before JOINing, AGGR?
I ask as I never see a clear viewpoint.
I did the following test:
// genned a DS of some 20M short rows df0.count val ds1 = df0.repartition(765) ds1.count val ds2 = df0.repartition(765) ds2.count sqlContext.setConf("spark.sql.shuffle.partitions", "765") // The above not included on 1st run, the above included on 2nd run. ds1.rdd.partitions.size ds2.rdd.partitions.size val joined = ds1.join(ds2, ds1("time_asc") === ds2("time_asc"), "outer") joined.rdd.partitions.size joined.count joined.rdd.partitions.size On the 1st test - not defining sqlContext.setConf("spark.sql.shuffle.partitions", "765"), the processing and num partitions resulted was 200. Even though SO post 45704156 states it may not apply to DFs - this is a DS.
On the 2nd test - defining sqlContext.setConf("spark.sql.shuffle.partitions", "765"), the processing and num partitions resulted was 765. Even though SO post 45704156 states it may not apply to DFs - this is a DS.