2

I want to read data from a database using Spark's JDBC API. I will be using 200 executors to read the data.

My question is that if i have provided 200 executor then will it create 200 connection to centralized database(JDBC) or will it fetch data from driver with single connection?

0

1 Answer 1

-2

When you establish connectivity spark.read.jdbc... you can specify numPartitions parameter. That manages max limit of how many parallel connection can be created.

The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

However, by default you read data to a single partition which usually doesn’t fully utilize your SQL database.

Sign up to request clarification or add additional context in comments.

4 Comments

In isolation numPartitions has no effect. It is used only if combined with other properties.
@user8371915 of course, but nevertheless this parameter control parallelism.
If and only if it is combined with bounds and partition column and predicates argument is not provided. Otherwise it is ignored
It works for limiting the number of connections used for writing though. Then numPartitions must be specified as an option to the DataFrameWriter (like dataset.write().option("numPartitions", 50)), not to the DataFrameReader. Then it will limit the number of connections used for writing with a "coalesce" operation.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.