2

I have a spark application with a very large Dataframe. I am currently registering the dataframe as a tempTable so I can perform several queries against it.

When I am using RDDs I use persist(StorageLevel.MEMORY_AND_DISK()) what is the equivalent for a tempTable.

Below are two possibilities, I don't think option 2 will work because cacheTable tries to cache in memory and my table is too big to fit in memory.

 DataFrame standardLocationRecords = inputReader.readAsDataFrame(sc, sqlc); // Option 1 // standardLocationRecords.persist(StorageLevel.MEMORY_AND_DISK()); standardLocationRecords.registerTempTable("standardlocationrecords"); // Option 2 // standardLocationRecords.registerTempTable("standardlocationrecords"); sqlc.cacheTable("standardlocationrecords"); 

How can I best cache my temptable so I can perform several queries against it without having to keep reloading the data.

Thanks, Nathan

1 Answer 1

2

I've just had a look at Spark 1.6.1 source code and actually Option 2 is what you want. Here's an excerpt from a comment on caching:

... Unlike RDD.cache(), the default storage level is set to be MEMORY_AND_DISK because recomputing the in-memory columnar representation of the underlying table is expensive.

 def cacheTable(tableName: String): Unit = { cacheManager.cacheQuery(table(tableName), Some(tableName)) } private[sql] def cacheQuery( query: Queryable, tableName: Option[String] = None, storageLevel: StorageLevel = MEMORY_AND_DISK): Unit 

Reference:

https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L355

https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L76

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.