How to maintain the partition in spark?

Question

I have a parquet folder partitioned by sensor_name and each sensors has same count of readings. When I read it using select, my dataframe looks like below,

 sensor_name | reading ---------------|--------------- a | 0.0 b | 2.0 c | 1.0 a | 0.0 b | 0.0 c | 1.0 ...

I want to do some transformation for each sensor (say multiply by 10) and then store it as a parquet folder with the same partitioning (i.e) partition by sensor_name.

When I run below, I realized spark does its own partitioning

df.write.format("parquet").mode("overwrite").save("path")

So, I changed like below to do partitioning and it was tremendously slow,

df.write.format("parquet").partitionBy("sensor_name").mode("overwrite").save("path")

Then I tried to repartition and it was better than before but still slow,

df.repartition("sensor_name").write.format("parquet").partitionBy("sensor_name").mode("overwrite").save("path")

Is there a way to tell Spark not to repartition it and honor my partition while doing select?

Hey I think this question might be related to stackoverflow.com/questions/35351873/… — FJ_OC
– FJ_OC, Commented Oct 12, 2022 at 15:32
I like explanation in this article: mungingdata.com/apache-spark/partitionby , try to read this — Artem Astashov
– Artem Astashov, Commented Oct 12, 2022 at 15:41

Betta · Accepted Answer · 2022-10-12 22:44:41Z

Is there a way to tell Spark not to repartition it and honor my partition while doing select?

There is none. If you need to have physical partition on the disk, you need to use partitionBy, unless you want read the individual partition data, enrich it and write it to that directory. You will need to do the combination of python code (I would do that in scala though) + pyspark api.

The code you are using the most efficient one and spark would optimize it. You may be seeing performance bottleneck either if you are running on standalone mode or having join operation which involves a shuffle

Thanks Betta. I am trying to transform the sensor value such that my dataframe ends up with 10000 columns. Think something like multiplying value with cos(1),cos(2) and so on. I am getting out of memory at executor. However when I try to do the same with pandas for one sensor, it was quick and no out of memory error. Hence thought honouring partition might help. Thanks again for the answer.

Collectives™ on Stack Overflow

How to maintain the partition in spark?

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related