Skip to main content
title
Link
Jacek Laskowski
  • 75k
  • 28
  • 253
  • 440

How to access a cachedcache partitioned dataset from sparkSQLand use in multiple queries?

added 4074 characters in body
Source Link

----------------- EDIT ------------------------

Here the diagram and logical plan of the queries, for me are the same, my expectation was that for the first query execute all the required steps and then it will directly access the in-memory view.

I have check with sqlContext.isCached("temp") and print true.

Query execution Diagram

enter image description here

First Query Plan

== Parsed Logical Plan == 'Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, 'purchase_activity AS value#4961] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, purchase_activity#4062 AS value#4961] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [purchase_activity#4062], [purchase_activity#4062] +- Project [purchase_activity#4062] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet 

Other Queries Plan

== Parsed Logical Plan == 'Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, 'top_brand_1 AS value#6819] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, top_brand_1#4072 AS value#6819] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [top_brand_1#4072], [top_brand_1#4072] +- Project [top_brand_1#4072] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet 

Here an screenshot of the Spark UI Storage page in case that could be helpfull too.

enter image description here

How can I access this persisted dataset from spark-sql?

How can I access this persisted dataset from spark-sql?

----------------- EDIT ------------------------

Here the diagram and logical plan of the queries, for me are the same, my expectation was that for the first query execute all the required steps and then it will directly access the in-memory view.

I have check with sqlContext.isCached("temp") and print true.

Query execution Diagram

enter image description here

First Query Plan

== Parsed Logical Plan == 'Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, 'purchase_activity AS value#4961] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, purchase_activity#4062 AS value#4961] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [purchase_activity#4062], [purchase_activity#4062] +- Project [purchase_activity#4062] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet 

Other Queries Plan

== Parsed Logical Plan == 'Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, 'top_brand_1 AS value#6819] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, top_brand_1#4072 AS value#6819] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [top_brand_1#4072], [top_brand_1#4072] +- Project [top_brand_1#4072] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet 

Here an screenshot of the Spark UI Storage page in case that could be helpfull too.

enter image description here

How can I access this persisted dataset from spark-sql?

added 52 characters in body
Source Link

I have the following code:

dataset .distinct() .repartition(400) .persist(StorageLevel.MEMORY_ONLY()) .createOrReplaceTempView("temp"); sqlContext.sql("select * from temp"); 

This is just an example, I need to execute around 100 queries over the same entity, that's why I'm persisting it. I thought that when I query temp it will query the cached entity, but when I check on the spark ui the Query Details, I see that a repartition is executed for each query over temp, thus is querying the dataset and executing the DAG for each query.

How can I access this persisted dataset from spark-sql?

I have the following code:

dataset.distinct().repartition(400).persist(StorageLevel.MEMORY_ONLY()).createOrReplaceTempView("temp"); sqlContext.sql("select * from temp"); 

This is just an example, I need to execute around 100 queries over the same entity, that's why I'm persisting it. I thought that when I query temp it will query the cached entity, but when I check on the spark ui the Query Details, I see that a repartition is executed for each query over temp.

How can I access this persisted dataset from spark-sql?

I have the following code:

dataset .distinct() .repartition(400) .persist(StorageLevel.MEMORY_ONLY()) .createOrReplaceTempView("temp"); sqlContext.sql("select * from temp"); 

This is just an example, I need to execute around 100 queries over the same entity, that's why I'm persisting it. I thought that when I query temp it will query the cached entity, but when I check on the spark ui the Query Details, I see that a repartition is executed for each query over temp, thus is querying the dataset and executing the DAG for each query.

How can I access this persisted dataset from spark-sql?

Source Link
Loading