Spark jdbc read performance tuning with no primary key column

Question

I am running a spark analytics application and reading MSSQL Server table (whole table) directly using spark jdbc. Thes table have more than 30M records but don't have any primary key column or integer column. Since the table don't have such column I cannot use the partitionColumn, hence it is taking too much time while reading the table.

val datasource = spark.read.format("jdbc") .option("url", "jdbc:sqlserver://host:1433;database=mydb") .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .option("dbtable", "dbo.table") .option("user", "myuser") .option("password", "password") .option("useSSL", "false").load()

Is there any way to improve the performance is such case and use the parallelism while reading data from relational database sources ( The source could be Oracle, MSSQL Server, MySQL, DB2).

Too vague. Joining? Only wanting incremental or all data oer table? — Ged
– Ged, Commented Sep 26, 2019 at 8:29
Possible duplicate of How to optimize partitioning when migrating data from JDBC source? — user10938362
– user10938362, Commented Sep 26, 2019 at 8:32
I have updated the question. Definitely i will have to read the entire table if we don't have any CLI column. — Sandeep Singh
– Sandeep Singh, Commented Sep 26, 2019 at 8:33
@user10958683 - not really, my table don't have any primary key or change data identifier column or integer column. — Sandeep Singh
– Sandeep Singh, Commented Sep 26, 2019 at 8:37
use sqoop and do some analysis to determine a suitable split column — Ged
– Ged, Commented Sep 26, 2019 at 17:40

gccodec · Accepted Answer · 2019-09-26 09:03:51Z

The only way is to write a query that return data partitioned and specify the partitionColumn into the new column generated, but I don't know if this can be really a speedup for your ingestion.

For example in a pseudo sql code:

val myReadQuery = SELECT *,(rowid %5) as part from table

And after

val datasource = spark.read.format("jdbc") .option("url", "jdbc:sqlserver://host:1433;database=mydb") .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .option("dbtable", s"($myReadQuery) as t") .option("user", "myuser") .option("password", "password") .option("useSSL", "false"). .option("numPartitions", 5) .option("partitionColumn", "part") .option("lowerBound", 1) .option("upperBound", 5).load()

But how I already say I am no really sure that this can improve your ingestion. Because this cause a 5 parallel query like this:

SELECT * from (select *, (rowid%5) as part from table) where part >= 0 and part < 1 SELECT * from (select *, (rowid%5) as part from table) where part >= 1 and part < 2 SELECT * from (select *, (rowid%5) as part from table) where part >= 2 and part < 3 SELECT * from (select *, (rowid%5) as part from table) where part >= 3 and part < 4 SELECT * from (select *, (rowid%5) as part from table) where part >= 4 and part < 5

But I think that if in your table there are an index you can use the index to extract an integer that with the mod operation can split the read operation and in the same time can speedup the read query.

I believe, partitionColumn should be either primary key column or integer column. Could you please explain about the usage of numPartitions, numPartitions lowerBound and upperBound in my case?
yes. PartitionColumn must be integer type but you can create an integer field that can works as a partitioner. numPartitions are the number of the read partition, lowerBound and upperBound are used to specify the max and minimum value of the partitionColumn so spark, internally can build the query as i wrote before and run the query on parallel without duplicates

Collectives™ on Stack Overflow

Spark jdbc read performance tuning with no primary key column

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related