Skip to main content
edited title
Link
Ram Ghadiyaram
  • 29.4k
  • 16
  • 102
  • 133

JDBC to Spark Dataframe - howHow to ensure even partitioning?

formatting done
Source Link
Ram Ghadiyaram
  • 29.4k
  • 16
  • 102
  • 133

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

The documentation seems to indicate that these fields are optional. What happens if I don't provide them? How does Spark know how to partition the queries? How efficient will that be?

 

If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

  • The documentation seems to indicate that these fields are optional. What happens if I don't provide them?
  • How does Spark know how to partition the queries? How efficient will that be?
  • If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

If so, is there a way to prevent this?

  • Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

  • If so, is there a way to prevent this?

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

The documentation seems to indicate that these fields are optional. What happens if I don't provide them? How does Spark know how to partition the queries? How efficient will that be?

If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

If so, is there a way to prevent this?

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

 
  • The documentation seems to indicate that these fields are optional. What happens if I don't provide them?
  • How does Spark know how to partition the queries? How efficient will that be?
  • If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

  • Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

  • If so, is there a way to prevent this?

Source Link
JoeMjr2
  • 4k
  • 5
  • 40
  • 63

JDBC to Spark Dataframe - how to ensure even partitioning

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

The documentation seems to indicate that these fields are optional. What happens if I don't provide them? How does Spark know how to partition the queries? How efficient will that be?

If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

If so, is there a way to prevent this?