1

I have a query which joins 4 tables and i used query pushdown to read it into a dataframe.

val df = spark.read.format("jdbc"). option("url", "jdbc:mysql://ip/dbname"). option("driver", "com.mysql.jdbc.Driver"). option("user", "username"). option("password", "password") .option("dbtable",s"($query) as temptable") .load() 

The number of records in individual tables are 430, 350, 64, 2354 respectively and it takes 12.784 sec to load and 2.119 sec for creating SparkSession

then I count the resultdata as,

 val count=df.count() println(s"count $count") 

then the total execution time 25.806 sec and the result contains only 430 records.

When I try the same in sql workbench it only takes few sec to execute completely. Also I tried cache after load() but it take the same time. So how can I execute it much faster than what I did.

1
  • 2
    That doesn't seem like to much data to use spark. So why spark ? Commented Jan 2, 2019 at 9:11

2 Answers 2

5

You are using a tool meant to handle big data to solve toy examples and thus you are getting all of the overhead and none of the benefits

Sign up to request clarification or add additional context in comments.

2 Comments

i also tried with gb datas but same effect.so please help me if there any other way of coding join query using spark session.
fit in memory data is still not a use case for spark. In any event, the number of records in a query, a figure the DB already has once it performed the query, will take more time in spark which has to deserialize the response into its data structure, spread the data to executors, count and aggregate the results
0

Try Options like

partitionColumn

numPartitions

lowerBound

upperBound

These options will can help improving the performance of Query, as these will create multiple partitions and read will happen in parallel

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.