Using pyspark/Delta lakes on Databricks, I have the following scenario:
sdf = spark.read.format("delta").table("...") result = sdf.filter(...).groupBy(...).agg(...) analysis_1 = result.groupBy(...).count() # transformation performed here analysis_2 = result.groupBy(...).count() # transformation performed here As I understand Spark with Delta lakes, due to chained execution, result is not actually computed upon declaration, but rather when it is used.
However, in this example, it is used multiple times, and hence the most expensive transformation is computed multiple times.
Is it possible to force execution at some point in the code, e.g.
sdf = spark.read.format("delta").table("...") result = sdf.filter(...).groupBy(...).agg(...) result.force() # transformation performed here?? analysis_1 = result.groupBy(...).count() # quick smaller transformation?? analysis_2 = result.groupBy(...).count() # quick smaller transformation??