Spark - reducing dataframe size & caching it

Question

I have a very large DataFrame in Spark, and it takes too long to do operations on it.

It has 10M rows.

I want to sample it so I can test things more quickly, so I am trying:

val redux = df.limit(1000) redux.cache

I thought this would persist a dataframe with only 1K rows.

But running redux.count for example still takes too long (3 minutes).

I am running this on a 8 worker box w/ 6 GB RAM (from DataBricks).

Am I doing something wrong?

Thanks!

Please run this redux.count again and watch if it's faster - should be ;) Caching is lazy, it will be done while performing first action — T. Gawęda
– T. Gawęda, Commented Oct 19, 2016 at 20:37
YES! Thanks. Other actions ran pretty fast after the first one. =D — Rodrigo Stv
– Rodrigo Stv, Commented Oct 19, 2016 at 20:46

Rodrigo Stv · Accepted Answer · 2016-10-19 20:47:43Z

2

The answer is:

Caching is performed lazily, so even though the first "count" action will take some time, subsequent operations will be faster.

Credits to T. Gaweda

answered Oct 19, 2016 at 20:47

Rodrigo Stv

4253 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

T. Gawęda Over a year ago

It's often used in ML algorithms :) Input data is cached and then simple count() is calculated to perform caching. Next, when iterative part of alorithm is performed, it works on already cached data and is a lot of faster :)

Collectives™ on Stack Overflow

Spark - reducing dataframe size & caching it

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related