In a parquet data lake partitioned by year and month, with spark.default.parallelism set to i.e. 4, lets say I want to create a DataFrame comprised of months 11~12 from 2017, and months 1~3 from 2018 of two sources A and B.
df = spark.read.parquet( "A.parquet/_YEAR={2017}/_MONTH={11,12}", "A.parquet/_YEAR={2018}/_MONTH={1,2,3}", "B.parquet/_YEAR={2017}/_MONTH={11,12}", "B.parquet/_YEAR={2018}/_MONTH={1,2,3}", ) If I get the number of partitions, Spark used spark.default.parallelism as default:
df.rdd.getNumPartitions() Out[4]: 4 Taking into account that after creating df I need to perform join and groupBy operations over each period, and that data is more or less evenly distributed over each one (around 10 million rows per period):
Question
- Will a repartition improve the performance of my subsequent operations?
- If so, if I have 10 different periods (5 per year in both A and B), should I repartition by the number of periods and explicitly reference the columns to repartition (
df.repartition(10,'_MONTH','_YEAR'))?