I have a very large DataFrame in Spark, and it takes too long to do operations on it.
It has 10M rows.
I want to sample it so I can test things more quickly, so I am trying:
val redux = df.limit(1000) redux.cache I thought this would persist a dataframe with only 1K rows.
But running redux.count for example still takes too long (3 minutes).
I am running this on a 8 worker box w/ 6 GB RAM (from DataBricks).
Am I doing something wrong?
Thanks!
redux.countagain and watch if it's faster - should be ;) Caching is lazy, it will be done while performing first action