How to specify partition numbers when write a dataframe to parquet using PySpark

Question

I want to write a spark dataframe to parquet but rather than specify it as partitionBybut the numPartitions or the size of each partition. Is there an easy way to do that in PySpark?

zero323 · Accepted Answer · 2016-05-09 22:10:20Z

8

If all you care is the number of partitions the method is exactly the same as for any other output format - you can repartition DataFrame with given number of partitions and use DataFrameWriter afterwards:

df.repartition(n).write.parquet(some_path)

answered May 9, 2016 at 22:10

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MYjx Over a year ago

Thanks! But it seems repartition is very costly. I tried the coalesce but the job actually failed. Is there any requirement for numPartition in coalesce ? Should coalesce be less expensive than repartition?

zero323 Over a year ago

Only if change is relatively small. Otherwise it has to move data so the only advantage is lack of full shuffle. From the other hand it is less likely to provide uniform distribution.

Collectives™ on Stack Overflow

How to specify partition numbers when write a dataframe to parquet using PySpark

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related