Sample from pyspark dataframe slow to process

I am reading in a table from our SQL Server into Databricks. The source SQL table contains ~50M rows. On read, I am selecting one client, like this:

table1 = "(select Client, Revenue from dbo.sqltable where Client in ('0000000001')) A" df = spark.read.options(user=user, password=pswd, columnEncryptionSetting='Enabled') \ .jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + '; database=' + database, table1)

When I df.show() to show the result, I can see the table BUT it takes over a minute to process 400 rows (!). I thought maybe this is because the source table is so large and the .show() is working off the original 50M row df even though I filter the df upfront. If this is the case, how do I get my processing down to 0.1 seconds on the sampled df which is what I would expect for a .show() on a 400 row df (this is what we get on our dev cluster for 20K rows)? I have tried to create a temp view of the sampled df to see if this makes a difference but I don't know if I am on the right track:

df.createOrReplaceTempView('dftmp') spark.table("dftmp") spark.table("dftmp").cache spark.table("dftmp").count

All I get out of this is:

<bound method DataFrame.count of DataFrame[Client: string, Revenue: decimal(17,2)]>

I got the above from here: how-does-createorreplacetempview-work-in-spark

asked Feb 6, 2019 at 10:16

GivenX

5351 gold badge8 silver badges18 bronze badges

Additionally to Partitioning in spark while reading from RDBMS via JDBC. And count and cache are methods in Python (should be called) and doesn't make much sense in DataFrame API.

10465355
– 10465355

2019-02-06 10:21:30 +00:00
Commented Feb 6, 2019 at 10:21
Thanks @user10465355. df.cache() df.count() (to kick it into action which took a while - 1 minute) and then df.show() returned the results in 0.13 seconds

GivenX
– GivenX

2019-02-06 13:16:06 +00:00
Commented Feb 6, 2019 at 13:16

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Sample from pyspark dataframe slow to process [duplicate]

0

Linked

Hot Network Questions