7

I need to understand if there is any difference between the below two approaches of caching while using spark sql and is there any performance benefit of one over the another (considering building the dataframes are costly and I want to reuse it many times/hit many actions) ?

1> Cache the original data frame before registering it as temporary table

df.cache()

df.createOrReplaceTempView("dummy_table")

2> Register the dataframe as temporary table and cache the table

df.createOrReplaceTempView("dummy_table")

sqlContext.cacheTable("dummy_table")

Thanks in advance.

1 Answer 1

4

df.cache() is a lazy cache, which means that the cache would only occur when the next action is triggered.

sqlContext.cacheTable("dummy_table") is an eager cache, which mean the table will get cached as the command is called. An equivalent of this would be: spark.sql("CACHE TABLE dummy_table")

To answer your question if there is a performance benefit of one over another, it will be hard to tell without understand your entire workflow and how (and where) your cached dataframes are used. I'd recommend using the eager cache, so you won't have to second guess when (and whether) your dataframe is cached.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Arjoon for your explanation

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.