edited title

Link

edited Jun 11, 2019 at 3:30

29.4k
16
102
133

JDBC to Spark Dataframe - howHow to ensure even partitioning?

formatting done

Source Link

edited Jun 11, 2019 at 0:23

Ram Ghadiyaram

29.4k
16
102
133

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

The documentation seems to indicate that these fields are optional. What happens if I don't provide them? How does Spark know how to partition the queries? How efficient will that be?

If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

The documentation seems to indicate that these fields are optional. What happens if I don't provide them?

How does Spark know how to partition the queries? How efficient will that be?

If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

If so, is there a way to prevent this?

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

If so, is there a way to prevent this?

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

The documentation seems to indicate that these fields are optional. What happens if I don't provide them? How does Spark know how to partition the queries? How efficient will that be?

If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

If so, is there a way to prevent this?

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

The documentation seems to indicate that these fields are optional. What happens if I don't provide them?

How does Spark know how to partition the queries? How efficient will that be?

If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

If so, is there a way to prevent this?

Source Link

asked Jun 10, 2019 at 22:17

JoeMjr2

4k
5
40
63

JDBC to Spark Dataframe - how to ensure even partitioning

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.

I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.

The documentation seems to indicate that these fields are optional. What happens if I don't provide them? How does Spark know how to partition the queries? How efficient will that be?

If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?

Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.

Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?

If so, is there a way to prevent this?

Collectives™ on Stack Overflow

Return to Question

JDBC to Spark Dataframe - howHow to ensure even partitioning?

JDBC to Spark Dataframe - how to ensure even partitioning