How to improve query performance in spark?

Question

I have a query which joins 4 tables and i used query pushdown to read it into a dataframe.

val df = spark.read.format("jdbc"). option("url", "jdbc:mysql://ip/dbname"). option("driver", "com.mysql.jdbc.Driver"). option("user", "username"). option("password", "password") .option("dbtable",s"($query) as temptable") .load()

The number of records in individual tables are 430, 350, 64, 2354 respectively and it takes 12.784 sec to load and 2.119 sec for creating SparkSession

then I count the resultdata as,

 val count=df.count() println(s"count $count")

then the total execution time 25.806 sec and the result contains only 430 records.

When I try the same in sql workbench it only takes few sec to execute completely. Also I tried cache after load() but it take the same time. So how can I execute it much faster than what I did.

That doesn't seem like to much data to use spark. So why spark ? — eliasah
– eliasah, Commented Jan 2, 2019 at 9:11

Arnon Rotem-Gal-Oz · Accepted Answer · 2019-01-02 09:50:03Z

5

You are using a tool meant to handle big data to solve toy examples and thus you are getting all of the overhead and none of the benefits

answered Jan 2, 2019 at 9:50

Arnon Rotem-Gal-Oz

26k3 gold badges51 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sajila S Over a year ago

i also tried with gb datas but same effect.so please help me if there any other way of coding join query using spark session.

Arnon Rotem-Gal-Oz Over a year ago

fit in memory data is still not a use case for spark. In any event, the number of records in a query, a figure the DB already has once it performed the query, will take more time in spark which has to deserialize the response into its data structure, spread the data to executors, count and aggregate the results

ankit · Accepted Answer · 2019-11-13 05:13:44Z

Try Options like

partitionColumn

numPartitions

lowerBound

upperBound

These options will can help improving the performance of Query, as these will create multiple partitions and read will happen in parallel

Collectives™ on Stack Overflow

How to improve query performance in spark?

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related