0

I am reading in a table from our SQL Server into Databricks. The source SQL table contains ~50M rows. On read, I am selecting one client, like this:

table1 = "(select Client, Revenue from dbo.sqltable where Client in ('0000000001')) A" df = spark.read.options(user=user, password=pswd, columnEncryptionSetting='Enabled') \ .jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + '; database=' + database, table1) 

When I df.show() to show the result, I can see the table BUT it takes over a minute to process 400 rows (!). I thought maybe this is because the source table is so large and the .show() is working off the original 50M row df even though I filter the df upfront. If this is the case, how do I get my processing down to 0.1 seconds on the sampled df which is what I would expect for a .show() on a 400 row df (this is what we get on our dev cluster for 20K rows)? I have tried to create a temp view of the sampled df to see if this makes a difference but I don't know if I am on the right track:

df.createOrReplaceTempView('dftmp') spark.table("dftmp") spark.table("dftmp").cache spark.table("dftmp").count 

All I get out of this is:

<bound method DataFrame.count of DataFrame[Client: string, Revenue: decimal(17,2)]> 

I got the above from here: how-does-createorreplacetempview-work-in-spark

2

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.