I want to write a spark dataframe to parquet but rather than specify it as partitionBybut the numPartitions or the size of each partition. Is there an easy way to do that in PySpark?
1 Answer
If all you care is the number of partitions the method is exactly the same as for any other output format - you can repartition DataFrame with given number of partitions and use DataFrameWriter afterwards:
df.repartition(n).write.parquet(some_path) 2 Comments
MYjx
Thanks! But it seems
repartition is very costly. I tried the coalesce but the job actually failed. Is there any requirement for numPartition in coalesce ? Should coalesce be less expensive than repartition?zero323
Only if change is relatively small. Otherwise it has to move data so the only advantage is lack of full shuffle. From the other hand it is less likely to provide uniform distribution.