SparkSql with JDBC Connection [duplicate]

Question

I want to read data from a database using Spark's JDBC API. I will be using 200 executors to read the data.

My question is that if i have provided 200 executor then will it create 200 connection to centralized database(JDBC) or will it fetch data from driver with single connection?

vvg · Accepted Answer · 2018-05-17 14:07:43Z

When you establish connectivity spark.read.jdbc... you can specify numPartitions parameter. That manages max limit of how many parallel connection can be created.

The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

However, by default you read data to a single partition which usually doesn’t fully utilize your SQL database.

In isolation numPartitions has no effect. It is used only if combined with other properties.
@user8371915 of course, but nevertheless this parameter control parallelism.
If and only if it is combined with bounds and partition column and predicates argument is not provided. Otherwise it is ignored
It works for limiting the number of connections used for writing though. Then numPartitions must be specified as an option to the DataFrameWriter (like dataset.write().option("numPartitions", 50)), not to the DataFrameReader. Then it will limit the number of connections used for writing with a "coalesce" operation.

Collectives™ on Stack Overflow

SparkSql with JDBC Connection [duplicate]

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related