How to cache file parts in memory as RDD in Spark?

Question

I need streaming read to very large files(in TBs). To achieve higher throughput, if we can cache the file parts in memory. Spark can cache data in distributed memory. How can I use spark to cache file parts ?

Files are bigger than the local storage of any one computer and bigger than the sum total capacity of memory in the cluster.

Tathagata Das · Accepted Answer · 2014-05-09 06:02:56Z

Store the data in a distributed storage system like HDFS, etc. This will store your data in a distributed manner. You have to choose the right file system depending on your requirement (on-premise, or in cloud, etc.)
Run Spark on the data in the HDFS file. Create an RDD from the file (see spark documentation), filter out the part of the data you actually needs (example, only the lines containing "error" in a large log file), and cache the necessary part in memory (so that subsequent queries are faster).

There are number of caching related parameters that you can tune that help you to fit your data in memory (keeping data serialized with kryo serialization, etc.). See Memory Tuning guide for defails.

You can also consider breaking the data into parts (separate files, partitioned tables, etc.) and load only a part of it Spark.

Thanks for this but If I write a Spark Job the data cached is only live until this driver die. For Example I run a query wit SparkSQL when the job is done the cache is done too. How can I keep a spark job always Up ?

Collectives™ on Stack Overflow

How to cache file parts in memory as RDD in Spark?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related