2

I execute Spark SQL reading from Hive Tables and it is lengthy in execution(15 min). I am interested in optimizing the query execution so I am asking about if the execution for those queries uses the execution engine of Hive and by this way it is similar to executing the queries in Hive editor, or Spark use the Hive Metastore only to know the locations of the files and deals with the files after that directly?

import os import findspark findspark.init() from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("yarn") \ .appName("src_count") \ .config('spark.executor.cores','5') \ .config('spark.executor.memory','29g') \ .config('spark.driver.memory','16g') \ .config('spark.driver.maxResultSize','12g')\ .config("spark.dynamicAllocation.enabled", "true")\ .config("spark.shuffle.service.enabled", "true")\ .getOrCreate() sql = "SELECT S.SERVICE, \ COUNT(DISTINCT CONTRACT_KEY) DISTINCT_CNT, \ COUNT(*) CNT ... " df.toPandas() 
5
  • why you want to convert it into pandas dataframe? is there any specific need for that? Commented Jul 29, 2019 at 10:39
  • Spark sql engine uses hive in general even if you dont directly work with hive Commented Jul 29, 2019 at 11:52
  • @vikrantrana it is aggregation query and return limited number of records less than 20 record. Commented Jul 30, 2019 at 8:33
  • @IlyaBrodezki does it use it as a metastore only and execute it as sort of RDD or DataFrames for example, or use Hive server for the execution as if I am executing it in the Hive editor? Commented Jul 30, 2019 at 8:36
  • you can use spark in-built functions to improve the performance.. choose pandas or python function if something cannot be done using spark inbuilt functions.. Commented Jul 30, 2019 at 9:58

1 Answer 1

3

You can read the HIVE table as follows:

  1. Read Entire HIVE Table

df = spark.table(<HIVE_DB>.<HIVE_TBL>)

  1. You can read the partial table based on SQL query.

df = spark.sql(<YOUR_SQL_Query>)

Also, in your question you are trying to convert the Spark DataFrame to Python DataFrame which is not recommended. Because, in this case you are sending all data from worker to driver which trafer lof of data across the network and slowdown the application and also your driver will be overloaded since it will have entire dataset and it may go to OOM as well.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.