5

I want to write a spark dataframe to parquet but rather than specify it as partitionBybut the numPartitions or the size of each partition. Is there an easy way to do that in PySpark?

1 Answer 1

8

If all you care is the number of partitions the method is exactly the same as for any other output format - you can repartition DataFrame with given number of partitions and use DataFrameWriter afterwards:

df.repartition(n).write.parquet(some_path) 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! But it seems repartition is very costly. I tried the coalesce but the job actually failed. Is there any requirement for numPartition in coalesce ? Should coalesce be less expensive than repartition?
Only if change is relatively small. Otherwise it has to move data so the only advantage is lack of full shuffle. From the other hand it is less likely to provide uniform distribution.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.