Spark DataFrame Cache Large TempTable

Question

I have a spark application with a very large Dataframe. I am currently registering the dataframe as a tempTable so I can perform several queries against it.

When I am using RDDs I use persist(StorageLevel.MEMORY_AND_DISK()) what is the equivalent for a tempTable.

Below are two possibilities, I don't think option 2 will work because cacheTable tries to cache in memory and my table is too big to fit in memory.

 DataFrame standardLocationRecords = inputReader.readAsDataFrame(sc, sqlc); // Option 1 // standardLocationRecords.persist(StorageLevel.MEMORY_AND_DISK()); standardLocationRecords.registerTempTable("standardlocationrecords"); // Option 2 // standardLocationRecords.registerTempTable("standardlocationrecords"); sqlc.cacheTable("standardlocationrecords");

How can I best cache my temptable so I can perform several queries against it without having to keep reloading the data.

Thanks, Nathan

radek1st · Accepted Answer · 2016-06-20 13:19:05Z

I've just had a look at Spark 1.6.1 source code and actually Option 2 is what you want. Here's an excerpt from a comment on caching:

... Unlike RDD.cache(), the default storage level is set to be MEMORY_AND_DISK because recomputing the in-memory columnar representation of the underlying table is expensive.

 def cacheTable(tableName: String): Unit = { cacheManager.cacheQuery(table(tableName), Some(tableName)) } private[sql] def cacheQuery( query: Queryable, tableName: Option[String] = None, storageLevel: StorageLevel = MEMORY_AND_DISK): Unit

Reference:

https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L355

https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L76

Collectives™ on Stack Overflow

Spark DataFrame Cache Large TempTable

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related