I have a parquet folder partitioned by sensor_name and each sensors has same count of readings. When I read it using select, my dataframe looks like below,
sensor_name | reading ---------------|--------------- a | 0.0 b | 2.0 c | 1.0 a | 0.0 b | 0.0 c | 1.0 ... I want to do some transformation for each sensor (say multiply by 10) and then store it as a parquet folder with the same partitioning (i.e) partition by sensor_name.
When I run below, I realized spark does its own partitioning
df.write.format("parquet").mode("overwrite").save("path") So, I changed like below to do partitioning and it was tremendously slow,
df.write.format("parquet").partitionBy("sensor_name").mode("overwrite").save("path") Then I tried to repartition and it was better than before but still slow,
df.repartition("sensor_name").write.format("parquet").partitionBy("sensor_name").mode("overwrite").save("path") Is there a way to tell Spark not to repartition it and honor my partition while doing select?