Partitioning in spark while reading from RDBMS via JDBC

Question

I am running spark in cluster mode and reading data from RDBMS via JDBC.

As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers:

partitionColumn
lowerBound
upperBound
numPartitions

These are optional parameters.

What would happen if I don't specify these:

Only 1 worker read the whole data?
If it still reads parallelly, how does it partition data?

Do you have sample code to read records from RDBMS via jdbc? — Surender Raja
– Surender Raja, Commented Jun 19, 2021 at 3:30
@SurenderRaja you can check - gist.github.com/devender-yadav/5c4328918602b7910ba883e18b68fd87 — Dev
– Dev, Commented Jun 19, 2021 at 4:39

zero323 · Accepted Answer · 2019-01-17 13:36:17Z

If you don't specify either {partitionColumn, lowerBound, upperBound, numPartitions} or {predicates} Spark will use a single executor and create a single non-empty partition. All data will be processed using a single transaction and reads will be neither distributed nor parallelized.

Collectives™ on Stack Overflow

Partitioning in spark while reading from RDBMS via JDBC

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related