title

Link

edited May 16, 2017 at 8:47

75k
28
253
440

How to access a cachedcache partitioned dataset from sparkSQLand use in multiple queries?

added 4074 characters in body

Source Link

edited May 13, 2017 at 14:16

José Carlos Guevara Turruelles

481
9
23

----------------- EDIT ------------------------

Here the diagram and logical plan of the queries, for me are the same, my expectation was that for the first query execute all the required steps and then it will directly access the in-memory view.

I have check with sqlContext.isCached("temp") and print true.

Query execution Diagram

First Query Plan

== Parsed Logical Plan == 'Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, 'purchase_activity AS value#4961] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, purchase_activity#4062 AS value#4961] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [purchase_activity#4062], [purchase_activity#4062] +- Project [purchase_activity#4062] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet

Other Queries Plan

== Parsed Logical Plan == 'Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, 'top_brand_1 AS value#6819] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, top_brand_1#4072 AS value#6819] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [top_brand_1#4072], [top_brand_1#4072] +- Project [top_brand_1#4072] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet

Here an screenshot of the Spark UI Storage page in case that could be helpfull too.

How can I access this persisted dataset from spark-sql?

----------------- EDIT ------------------------

Here the diagram and logical plan of the queries, for me are the same, my expectation was that for the first query execute all the required steps and then it will directly access the in-memory view.

I have check with sqlContext.isCached("temp") and print true.

Query execution Diagram

First Query Plan

== Parsed Logical Plan == 'Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, 'purchase_activity AS value#4961] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, purchase_activity#4062 AS value#4961] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [purchase_activity#4062], [purchase_activity#4062] +- Project [purchase_activity#4062] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet

Other Queries Plan

== Parsed Logical Plan == 'Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, 'top_brand_1 AS value#6819] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, top_brand_1#4072 AS value#6819] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [top_brand_1#4072], [top_brand_1#4072] +- Project [top_brand_1#4072] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet

Here an screenshot of the Spark UI Storage page in case that could be helpfull too.

How can I access this persisted dataset from spark-sql?

added 52 characters in body

Source Link

edited May 12, 2017 at 13:56

José Carlos Guevara Turruelles

481
9
23

I have the following code:

dataset .distinct() .repartition(400) .persist(StorageLevel.MEMORY_ONLY()) .createOrReplaceTempView("temp"); sqlContext.sql("select * from temp");

This is just an example, I need to execute around 100 queries over the same entity, that's why I'm persisting it. I thought that when I query temp it will query the cached entity, but when I check on the spark ui the Query Details, I see that a repartition is executed for each query over temp, thus is querying the dataset and executing the DAG for each query.

How can I access this persisted dataset from spark-sql?

I have the following code:

dataset.distinct().repartition(400).persist(StorageLevel.MEMORY_ONLY()).createOrReplaceTempView("temp"); sqlContext.sql("select * from temp");

This is just an example, I need to execute around 100 queries over the same entity, that's why I'm persisting it. I thought that when I query temp it will query the cached entity, but when I check on the spark ui the Query Details, I see that a repartition is executed for each query over temp.

How can I access this persisted dataset from spark-sql?

I have the following code:

dataset .distinct() .repartition(400) .persist(StorageLevel.MEMORY_ONLY()) .createOrReplaceTempView("temp"); sqlContext.sql("select * from temp");

This is just an example, I need to execute around 100 queries over the same entity, that's why I'm persisting it. I thought that when I query temp it will query the cached entity, but when I check on the spark ui the Query Details, I see that a repartition is executed for each query over temp, thus is querying the dataset and executing the DAG for each query.

How can I access this persisted dataset from spark-sql?

Source Link

asked May 12, 2017 at 13:40

José Carlos Guevara Turruelles

481
9
23

Loading

Collectives™ on Stack Overflow

Return to Question

How to access a cachedcache partitioned dataset from sparkSQLand use in multiple queries?