I have a function that joins a list of dataframes to a base dataframe and returns a dataframe. I am trying to reduce the time this operation takes. Since I was joining multiple times using the base dataframe, I cached it but the runtime is still similar. This is the function I am using it
def merge_dataframes(base_df, df_list, id_col): """ Joins multiple dataframes using an identifier variable common across datasets :param base_df: everything will be added to this dataframe :param df_list: dfs that have to be joined to main dataset :param id_col: the identifier column :return: dataset with all joins """ base_df.persist(StorageLevel.MEMORY_AND_DISK) for each_df in df_list: base_df = base_df.join(each_df, id_col) base_df.unpersist() return base_df I was surprised to get similar results after caching. Whats the reason behind this and what can I do to make this consume less time.
Also since the datasets I am using currently are relatively small (~50k records) so I don't have an issue with caching datasets as and when needed as long as I decache them.