3

Given partitioned by some_field (of int type) Hive table with data stored as Avro files, I want to query table using Spark SQL in a way that returned Data Frame have to be already partitioned by some_field (used for partitioning).

Query looks like just

SELECT * FROM some_table 

By default Spark doesn't do that, returned data_frame.rdd.partitioner is None.

One way to get result is via explicit repartitioning after querying, but probably there is better solution.

HDP 2.6, Spark 2.

Thanks.

2
  • 1
    I think there are 2 separate things that you are talking about, hive partition and dataset partitioning and both are completely independent. Follow line to read about rdd/dataset partitioning. Commented Nov 8, 2017 at 16:45
  • Of course, they are independent, but until execution engine cannot utilize underlying storage partitioning, latter is useless. Thanks for link. Commented Nov 8, 2017 at 17:23

1 Answer 1

4

First of all you have to distinguish between partitioning of a Dataset and partitioning of the converted RDD[Row]. No matter what is the execution plan of the former one, the latter one won't have a Partitioner:

scala> val df = spark.range(100).repartition(10, $"id") df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.rdd.partitioner res1: Option[org.apache.spark.Partitioner] = None 

However internal RDD, might have a Partitioner:

scala> df.queryExecution.toRdd.partitioner res2: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@5a05e0f3) 

This however is unlikely to help you here, because as of today (Spark 2.2), Data Source API is not aware of the physical storage information (with exception of simple partition pruning). This should change in the upcoming Data Source API. Please refer to the JIRA ticket (SPARK-15689) and design document for details.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.