I have about 10,000 different Spark Dataframes that needs to be merged using union, but the union takes a very long time.
Below is a brief sample of the code I ran, dfs is a collection of the Dataframes that I'd like to use union on:
from functools import reduce from pyspark.sql import DataFrame dfOut = reduce(DataFrame.unionAll, dfs) It seems that when I union 100-200 dataframes, it is quite fast. But the running time increases exponentially when I increase the number of dataframes to merge.
Any suggestions on improving the efficiency? Thanks a lot!
unionbut I've experienced issues with long sequences of other transformations (usually dozens of sequentialwithColumntransformation). The solution for my case was to insertcachecalls between each N transformations. And the reason was garbage collection on driver node: as number of transformations grows larger, it takes more memory (maybe exponentially more) to build and optimize the execution plan. You may try to splitdfsinto chunks of 100-200 dataframes, union everything in chunk, then callcacheorcheckpointthen union all chunks. I hope, it will help.