1

I have a very large DataFrame in Spark, and it takes too long to do operations on it.

It has 10M rows.

I want to sample it so I can test things more quickly, so I am trying:

val redux = df.limit(1000) redux.cache 

I thought this would persist a dataframe with only 1K rows.

But running redux.count for example still takes too long (3 minutes).

I am running this on a 8 worker box w/ 6 GB RAM (from DataBricks).

Am I doing something wrong?

Thanks!

2
  • 1
    Please run this redux.count again and watch if it's faster - should be ;) Caching is lazy, it will be done while performing first action Commented Oct 19, 2016 at 20:37
  • YES! Thanks. Other actions ran pretty fast after the first one. =D Commented Oct 19, 2016 at 20:46

1 Answer 1

2

The answer is:

Caching is performed lazily, so even though the first "count" action will take some time, subsequent operations will be faster.

Credits to T. Gaweda

Sign up to request clarification or add additional context in comments.

1 Comment

It's often used in ML algorithms :) Input data is cached and then simple count() is calculated to perform caching. Next, when iterative part of alorithm is performed, it works on already cached data and is a lot of faster :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.