When to cache a DataFrame?

Question

My question is - when should I do dataframe.cache() and when it's useful?

Also, in my code should I cache the dataframes in the commented lines?

Note: My dataframes are loaded from a Redshift DB.

Many thanks

Here my code:

def sub_tax_transfer_pricing_eur_aux(manager, dataframe, seq_recs, seq_reservas, df_aux): df_vta = manager.get_dataframe(tables['dwc_oth_v_re_v_impuesto_sap_vta']) df_cpa = manager.get_dataframe(tables['dwc_oth_v_re_v_impuesto_sap_cpa']) dataframe = dataframe.filter(dataframe.seq_rec.isin(seq_recs)) \ .filter(dataframe.seq_reserva.isin(seq_reservas)) ################################################## #SHOULD I CACHE HERE df_vta, df_cpa and dataframe ################################################## dataframe = dataframe.join(df_vta, [dataframe.ind_tipo_imp_vta_fac == df_vta.ind_tipo_imp_vta, dataframe.cod_impuesto_vta_fac == df_vta.cod_impuesto_vta, dataframe.cod_clasif_vta_fac == df_vta.cod_clasif_vta, dataframe.cod_esquema_vta_fac == df_vta.cod_esquema_vta, dataframe.cod_empresa_vta_fac == df_vta.cod_emp_atlas_vta, ]).drop("ind_tipo_imp_vta", "cod_impuesto_vta", "cod_clasif_vta", "cod_esquema_vta", "cod_emp_atlas_vta") \ .join(df_cpa, [dataframe.ind_tipo_imp_vta_fac == df_cpa.ind_tipo_imp_cpa, dataframe.cod_impuesto_vta_fac == df_cpa.cod_impuesto_cpa, dataframe.cod_clasif_vta_fac == df_cpa.cod_clasif_cpa, dataframe.cod_esquema_vta_fac == df_cpa.cod_esquema_cpa, dataframe.cod_empresa_vta_fac == df_cpa.cod_emp_atlas_cpa, ]).drop("ind_tipo_imp_cpa", "cod_impuesto_cpa", "cod_clasif_cpa", "cod_esquema_cpa", "cod_emp_atlas_cpa") \ .select("seq_rec", "seq_reserva", "ind_tipo_regimen_fac", "imp_margen_canal", "ind_tipo_regimen_con", "imp_coste", "imp_margen_canco", "imp_venta", "pct_impuesto_vta", "pct_impuesto_cpa") ###################################### #SHOULD I CACHE HERE dataframe AGAIN ? ###################################### dataframe = dataframe.withColumn("amount1", func.when(dataframe.ind_tipo_regimen_fac == 'E', dataframe.imp_margen_canal * ( 1 - (1 / (1 + (dataframe.pct_impuesto_vta / 100))))) .otherwise(dataframe.imp_venta * ( 1 - (1 / (1 + (dataframe.pct_impuesto_vta / 100)))) - ( dataframe.imp_venta - dataframe.imp_margen_canal) * ( 1 - (1 / (1 + (dataframe.pct_impuesto_cpa / 100)))))) dataframe = dataframe.withColumn("amount2", func.when(dataframe.ind_tipo_regimen_con == 'E', dataframe.imp_margen_canco * ( 1 - (1 / (1 + (dataframe.pct_impuesto_vta / 100))))) .otherwise((dataframe.imp_coste + dataframe.imp_margen_canco) * ( 1 - (1 / (1 + (dataframe.pct_impuesto_vta / 100)))) - ( dataframe.imp_coste) * ( 1 - (1 / (1 + (dataframe.pct_impuesto_cpa / 100)))))) dataframe = dataframe.na.fill({'amount1': 0}) dataframe = dataframe.na.fill({'amount2': 0}) dataframe = dataframe.join(df_aux, [dataframe.seq_rec == df_aux.operative_incoming, dataframe.seq_reserva == df_aux.booking_id]) dataframe = dataframe.withColumn("impuesto_canco1", udf_currency_exchange(dataframe.booking_currency, func.lit(EUR), dataframe.creation_date, dataframe.amount1)) dataframe = dataframe.withColumn("impuesto_canco2", udf_currency_exchange(dataframe.booking_currency, func.lit(EUR), dataframe.creation_date, dataframe.amount2)) dataframe = dataframe.withColumn("impuesto_canco", dataframe.impuesto_canco1 + dataframe.impuesto_canco2) dataframe = dataframe.na.fill({'impuesto_canco': 0}) dataframe = dataframe.select("operative_incoming", "booking_id", "impuesto_canco") ###################################### #SHOULD I CACHE HERE dataframe AGAIN ? ###################################### dataframe = dataframe.groupBy("operative_incoming", "booking_id").agg({'impuesto_canco': 'sum'}). \ withColumnRenamed("SUM(impuesto_canco)", "impuesto_canco") return dataframe

As you are not calling multiple actions, I don't think you should cache these dataframes. — ravi malhotra
– ravi malhotra, Commented Sep 29, 2019 at 8:59

Jacek Laskowski · Accepted Answer · 2017-05-24 11:41:31Z

when should I do dataframe.cache() and when it's usefull?

cache what you are going to use across queries (and early and often up to available memory). It does not really matter what programming language you use (Python or Scala or Java or SQL or R) as the underlying mechanics is the same.

You can see if a DataFrame was cached in your physical plan using explain operator (where InMemoryRelation entities reflect cached datasets with their storage level):

== Physical Plan == *Project [id#0L, id#0L AS newId#16L] +- InMemoryTableScan [id#0L] +- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *Range (0, 1, step=1, splits=Some(8))

After you cache (or persist) your DataFrame the first query may get slower, but it is going to pay off for the following queries.

You can check whether a Dataset was cached or not using the following code:

scala> :type q2 org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] val cache = spark.sharedState.cacheManager scala> cache.lookupCachedData(q2.queryExecution.logical).isDefined res0: Boolean = false

Also, in my code should I cache the dataframes in the commented lines?

Yes and no. Cache what represents external datasets so you don't pay the extra price of transmitting data across network (while accessing the external storage) every time you query over them.

Don't cache what you use only once or is easy to compute. Otherwise, cache.

Be careful what you cache, i.e. what Dataset is cached, as it gives different queries cached.

// cache after range(5) val q1 = spark.range(5).cache.filter($"id" % 2 === 0).select("id") scala> q1.explain == Physical Plan == *Filter ((id#0L % 2) = 0) +- InMemoryTableScan [id#0L], [((id#0L % 2) = 0)] +- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *Range (0, 5, step=1, splits=8) // cache at the end val q2 = spark.range(1).filter($"id" % 2 === 0).select("id").cache scala> q2.explain == Physical Plan == InMemoryTableScan [id#17L] +- InMemoryRelation [id#17L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *Filter ((id#17L % 2) = 0) +- *Range (0, 1, step=1, splits=8)

There's one surprise with caching in Spark SQL. Caching is lazy and that's why you pay the extra price to have rows cached the very first action, but that only happens with DataFrame API. In SQL, caching is eager which makes a huge difference in query performance as you don't have you call an action to trigger caching.

Even if we are reusing a dataframe/rdd multiple times within a same spark job, does it makes sense to cache? I understand that if the computation what is been done is quite complex then we can cache dataframe to avoid recomputation in case an executor dies even though that particular dataframe i been used just once. But in other cases does it makes sense to cahce a dataframe which is used multiple times within a same job?
@ravimalhotra Cache a dataset unless you know it's a waste of time :) In other words, always cache a dataframe that is used multiple time within the same job.

Vaidas Armonas · Accepted Answer · 2018-01-04 15:43:32Z

Actually in your case .cache() won't help at all. You are not executing any action on your (at least not in your provided function) dataframe. .cache() is a good idea if you will use data several times over like:

data = sub_tax_transfer_pricing_eur_aux(...).cache() one_use_case = data.groupBy(...).agg(...).show() another_use_case = data.groupBy(...).agg(...).show()

This way you will fetch data only once (when first action is called .show() and then the next use of data dataframe should be faster. However, use this with caution - sometime fetching data again is still faster. Also, I would advise against naming the same name your dataframe over and over again. Dataframes are immutable objects, after all.

Hope this is helpful.

Nandeesh · Accepted Answer · 2019-05-08 11:57:22Z

Caching RDDs in Spark: It is one mechanism to speed up applications that access the same RDD multiple times. An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD. There are two function calls for caching an RDD: cache() and persist(level: StorageLevel). The difference among them is that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level. persist() without an argument is equivalent with cache(). We discuss caching strategies later in this post. Freeing up space from the Storage memory is performed by unpersist().

When to use caching: As suggested in this post, it is recommended to use caching in the following situations:

RDD re-use in iterative machine learning applications
RDD re-use in standalone Spark applications
When RDD computation is expensive, caching can help in reducing the cost of recovery in the case one executor fails

"An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD" - the most important sentence of this entire topic.
@jd2050 Is this true for ALL actions? I refer specifically to take and show, for instance, as I think I've heard they have secret sauce behind them for this issue

Collectives™ on Stack Overflow

When to cache a DataFrame?

3 Answers 3

2 Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Linked

Related