I have the following strategy to change a dataframe df.
df = T1(df) df.cache() df = T2(df) df.cache() . . . df = Tn(df) df.cache() Here T1, T2, ..., Tn are n transformations that return spark dataframes. Repeated caching is used because df has to pass through a lot of transformations and used mutiple times in between; without caching lazy evaluation of the transformations might make using df in between very slow. What I am worried about is that the n dataframes that are cached one by one will gradually consume the RAM. I read that spark automatically un-caches "least recently used" items. Based on this I have the following queries -
- How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
- Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
- Based on the answer to the above two queries, is the above strategy correct?