1

I need streaming read to very large files(in TBs). To achieve higher throughput, if we can cache the file parts in memory. Spark can cache data in distributed memory. How can I use spark to cache file parts ?

Files are bigger than the local storage of any one computer and bigger than the sum total capacity of memory in the cluster.

1 Answer 1

1
  1. Store the data in a distributed storage system like HDFS, etc. This will store your data in a distributed manner. You have to choose the right file system depending on your requirement (on-premise, or in cloud, etc.)

  2. Run Spark on the data in the HDFS file. Create an RDD from the file (see spark documentation), filter out the part of the data you actually needs (example, only the lines containing "error" in a large log file), and cache the necessary part in memory (so that subsequent queries are faster).

There are number of caching related parameters that you can tune that help you to fit your data in memory (keeping data serialized with kryo serialization, etc.). See Memory Tuning guide for defails.

You can also consider breaking the data into parts (separate files, partitioned tables, etc.) and load only a part of it Spark.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this but If I write a Spark Job the data cached is only live until this driver die. For Example I run a query wit SparkSQL when the job is done the cache is done too. How can I keep a spark job always Up ?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.