0

I have about 10,000 different Spark Dataframes that needs to be merged using union, but the union takes a very long time.

Below is a brief sample of the code I ran, dfs is a collection of the Dataframes that I'd like to use union on:

from functools import reduce from pyspark.sql import DataFrame dfOut = reduce(DataFrame.unionAll, dfs) 

It seems that when I union 100-200 dataframes, it is quite fast. But the running time increases exponentially when I increase the number of dataframes to merge.

Any suggestions on improving the efficiency? Thanks a lot!

1
  • 2
    I am not sure about union but I've experienced issues with long sequences of other transformations (usually dozens of sequential withColumn transformation). The solution for my case was to insert cache calls between each N transformations. And the reason was garbage collection on driver node: as number of transformations grows larger, it takes more memory (maybe exponentially more) to build and optimize the execution plan. You may try to split dfs into chunks of 100-200 dataframes, union everything in chunk, then call cache or checkpoint then union all chunks. I hope, it will help. Commented Aug 20, 2019 at 10:04

1 Answer 1

2

The detail of this issue is available at https://issues.apache.org/jira/browse/SPARK-12616.

Union logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer (or analyzer?) to collapse all adjacent Unions into one.

Note that this problem doesn't exist in the physical plan, because the physical Union already supports an arbitrary number of children.

This was fixed in version 2.0.0. If you have to use a version lower than 2.0.0, union the data using RDDs union function.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the information! What are the modifications I could take to optimize the performance in this case? I'm using spark 2.4

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.