I have the following code:
df = sql_context.sql("select * from table").cache() first_df = df.where(df.id>10) second_df = df.where(df.city = 'NY') third_df = df.where(df.x == 5) unioned_df = first_df.union(second_df).union(third_df) unioned_df.format('csv').save(path) Because my code has only one action (write to csv). Is there a point for caching df?
Please ignore the fact that this filters could be done all together.
I did it like this in purpose in order to understand how the cache mechanism work in the backgorund.