2

I have a parquet folder partitioned by sensor_name and each sensors has same count of readings. When I read it using select, my dataframe looks like below,

 sensor_name | reading ---------------|--------------- a | 0.0 b | 2.0 c | 1.0 a | 0.0 b | 0.0 c | 1.0 ... 

I want to do some transformation for each sensor (say multiply by 10) and then store it as a parquet folder with the same partitioning (i.e) partition by sensor_name.

When I run below, I realized spark does its own partitioning

df.write.format("parquet").mode("overwrite").save("path") 

So, I changed like below to do partitioning and it was tremendously slow,

df.write.format("parquet").partitionBy("sensor_name").mode("overwrite").save("path") 

Then I tried to repartition and it was better than before but still slow,

df.repartition("sensor_name").write.format("parquet").partitionBy("sensor_name").mode("overwrite").save("path") 

Is there a way to tell Spark not to repartition it and honor my partition while doing select?

2

1 Answer 1

0

Is there a way to tell Spark not to repartition it and honor my partition while doing select?

There is none. If you need to have physical partition on the disk, you need to use partitionBy, unless you want read the individual partition data, enrich it and write it to that directory. You will need to do the combination of python code (I would do that in scala though) + pyspark api.

The code you are using the most efficient one and spark would optimize it. You may be seeing performance bottleneck either if you are running on standalone mode or having join operation which involves a shuffle

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Betta. I am trying to transform the sensor value such that my dataframe ends up with 10000 columns. Think something like multiplying value with cos(1),cos(2) and so on. I am getting out of memory at executor. However when I try to do the same with pandas for one sensor, it was quick and no out of memory error. Hence thought honouring partition might help. Thanks again for the answer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.