I have the following code:
dataset .distinct() .repartition(400) .persist(StorageLevel.MEMORY_ONLY()) .createOrReplaceTempView("temp"); sqlContext.sql("select * from temp"); This is just an example, I need to execute around 100 queries over the same entity, that's why I'm persisting it. I thought that when I query temp it will query the cached entity, but when I check on the spark ui the Query Details, I see that a repartition is executed for each query over temp, thus is querying the dataset and executing the DAG for each query.
----------------- EDIT ------------------------
Here the diagram and logical plan of the queries, for me are the same, my expectation was that for the first query execute all the required steps and then it will directly access the in-memory view.
I have check with sqlContext.isCached("temp") and print true.
Query execution Diagram
First Query Plan
== Parsed Logical Plan == 'Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, 'purchase_activity AS value#4961] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#4958, cube_purchase_details AS cube_name#4959, purchase_activity AS field#4960, purchase_activity#4062 AS value#4961] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [purchase_activity#4062], [purchase_activity#4062] +- Project [purchase_activity#4062] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet Other Queries Plan
== Parsed Logical Plan == 'Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, 'top_brand_1 AS value#6819] +- 'UnresolvedRelation `filter_temp` == Analyzed Logical Plan == tenant_id: string, cube_name: string, field: string, value: string Project [11 AS tenant_id#6816, cube_purchase_details AS cube_name#6817, top_brand_1 AS field#6818, top_brand_1#4072 AS value#6819] +- SubqueryAlias filter_temp, `filter_temp` +- Aggregate [top_brand_1#4072], [top_brand_1#4072] +- Project [top_brand_1#4072] +- Repartition 400, true +- GlobalLimit 10000 +- LocalLimit 10000 +- Project [purchase_activity#4062, top_shop_1#4069, top_brand_1#4072, top_brand_2#4073, top_brand_3#4074, top_brand_4#4075, top_brand_5#4076, top_manufacturer_1#4077, top_manufacturer_2#4078, top_manufacturer_3#4079, top_manufacturer_4#4080, top_manufacturer_5#4081, top_product_category_1#4082, top_product_category_2#4083, top_product_category_3#4084, top_product_category_4#4085, top_product_category_5#4086, top_salesperson_1#4093, top_salesperson_2#4094, top_salesperson_3#4095, age_category#4109, inactive#4115, activity_id#4144, activity_name#4145, ... 67 more fields] +- Relation[purchase_detail_id#3918,tenant_id#3919,purchase_detail_date#3920,purchase_detail_type#3921,user_id#3922,user_domain#3923,purchase_id#3924,purchase_date#3925,is_purchase#3926,year#3927,quarter#3928,month#3929,week#3930,weekday#3931,day#3932,former_purchase_id#3933,pd_shop_id#3934,customer_id#3935,loyalty_id#3936,quantity#3937,unit_price#3938,total_price#3939,discount#3940,currency#3941,... 219 more fields] parquet Here an screenshot of the Spark UI Storage page in case that could be helpfull too.
How can I access this persisted dataset from spark-sql?


