Spark SQL: Relational Data Processing in Spark
Challenges and Solutions Challenges • Perform ETL to and from various (semi- or unstructured) data sources • Perform advanced analytics (e.g. machine learning, graph processing) that are hard to express in relational systems. Solutions • A DataFrame API that can perform relational operations on both external data sources and Spark’s built-in RDDs. • A highly extensible optimizer, Catalyst, that uses features of Scala to add composable rule, control code gen., and define extensions. 2
3 SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data) Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments About SQL
Improvement upon Existing Art Engine does not understand the structure of the data in RDDs or the semantics of user functions  limited optimization. Can only be used to query external data in Hive catalog  limited data sources Can only be invoked via SQL string from Spark error prone Hive optimizer tailored for MapReduce  difficult to extend Set Footer from Insert Dropdown Menu 4
Programming Interface Set Footer from Insert Dropdown Menu 5
DataFrame • A distributed collection of rows with the same schema (RDDs suffer from type erasure) • Supports relational operators (e.g. where, groupby) as well as Spark operations. Set Footer from Insert Dropdown Menu 6
Data Model • Nested data model • Supports both primitive SQL types (boolean, integer, double, decimal, string, data, timestamp) and complex types (structs, arrays, maps, and unions); also user defined types. • First class support for complex data types Set Footer from Insert Dropdown Menu 7
DataFrame Operations • Relational operations (select, where, join, groupBy) via a DSL • Operators take expression objects • Operators build up an abstract syntax tree (AST), which is then optimized by Catalyst. • Alternatively, register as temp SQL table and perform traditional SQL query strings Set Footer from Insert Dropdown Menu 8
Advantages over Relational Query Languages • Holistic optimization across functions composed in different languages. • Control structures (e.g. if, for) • Logical plan analyzed eagerly  identify code errors associated with data schema issues on the fly. Set Footer from Insert Dropdown Menu 9
Catalyst Set Footer from Insert Dropdown Menu 10 Add Attribute(x) Literal(3) x + (1 + 2) x + 3
Plan Optimization & Execution 11 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
An Example Catalyst Transformation Set Footer from Insert Dropdown Menu 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down
Advanced Analytics Features Schema Inference for Semistructured Data JSON • Automatically infers schema from a set of records, in one pass or sample • A tree of STRUCT types, each of which may contain atoms, arrays, or other STRUCTs. • Find the most appropriate type for a field based on all data observed in that column. Determine array element types in the same way. • Merge schemata of single records in one reduce operation. • Same trick for Python typing Set Footer from Insert Dropdown Menu 13
The not-so-secret truth... 14 is about more than SQL. SQL
: Declarative BigData Processing Let Developers Create and Run Spark Programs Faster: • Write less code • Read less data • Let the optimizer do the hard work 15 SQL
DataFrame noun – [dey-tuh-freym] 16 1. A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).
Write Less Code: Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect()
Write Less Code: Compute an Average 18 Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name Using Pig P = load '/people' as (name, name); G = group P by name; R = foreach G generate … AVG(G.age);
Seamlessly Integrated: RDDs Internally, DataFrame execution is done with Spark RDDs making interoperation with outside sources and custom algorithms easy. Set Footer from Insert Dropdown Menu 19 External Input def buildScan( requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] Custom Processing queryResult.rdd.mapPartitions { iter => … Your code here … }
Extensible Input & Output Spark’s Data Source API allows optimizations like column pruning and filter pushdown into custom data sources. 20 { JSON } Built-In External JDBC and more…
Seamlessly Integrated Embedding in a full programming language makes UDFs trivial and allows composition using functions. 21 zipToCity = udf(lambda city: <custom logic here>) def add_demographics(events): u = sqlCtx.table("users") events .join(u, events.user_id == u.user_id) .withColumn("city", zipToCity(df.zip)) Takes and returns a DataFram e

Spark Sql and DataFrame

  • 1.
    Spark SQL: Relational DataProcessing in Spark
  • 2.
    Challenges and Solutions Challenges •Perform ETL to and from various (semi- or unstructured) data sources • Perform advanced analytics (e.g. machine learning, graph processing) that are hard to express in relational systems. Solutions • A DataFrame API that can perform relational operations on both external data sources and Spark’s built-in RDDs. • A highly extensible optimizer, Catalyst, that uses features of Scala to add composable rule, control code gen., and define extensions. 2
  • 3.
    3 SELECT COUNT(*) FROM hiveTable WHEREhive_udf(data) Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments About SQL
  • 4.
    Improvement upon ExistingArt Engine does not understand the structure of the data in RDDs or the semantics of user functions  limited optimization. Can only be used to query external data in Hive catalog  limited data sources Can only be invoked via SQL string from Spark error prone Hive optimizer tailored for MapReduce  difficult to extend Set Footer from Insert Dropdown Menu 4
  • 5.
    Programming Interface Set Footerfrom Insert Dropdown Menu 5
  • 6.
    DataFrame • A distributedcollection of rows with the same schema (RDDs suffer from type erasure) • Supports relational operators (e.g. where, groupby) as well as Spark operations. Set Footer from Insert Dropdown Menu 6
  • 7.
    Data Model • Nesteddata model • Supports both primitive SQL types (boolean, integer, double, decimal, string, data, timestamp) and complex types (structs, arrays, maps, and unions); also user defined types. • First class support for complex data types Set Footer from Insert Dropdown Menu 7
  • 8.
    DataFrame Operations • Relationaloperations (select, where, join, groupBy) via a DSL • Operators take expression objects • Operators build up an abstract syntax tree (AST), which is then optimized by Catalyst. • Alternatively, register as temp SQL table and perform traditional SQL query strings Set Footer from Insert Dropdown Menu 8
  • 9.
    Advantages over RelationalQuery Languages • Holistic optimization across functions composed in different languages. • Control structures (e.g. if, for) • Logical plan analyzed eagerly  identify code errors associated with data schema issues on the fly. Set Footer from Insert Dropdown Menu 9
  • 10.
    Catalyst Set Footer fromInsert Dropdown Menu 10 Add Attribute(x) Literal(3) x + (1 + 2) x + 3
  • 11.
    Plan Optimization &Execution 11 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 12.
    An Example CatalystTransformation Set Footer from Insert Dropdown Menu 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down
  • 13.
    Advanced Analytics Features SchemaInference for Semistructured Data JSON • Automatically infers schema from a set of records, in one pass or sample • A tree of STRUCT types, each of which may contain atoms, arrays, or other STRUCTs. • Find the most appropriate type for a field based on all data observed in that column. Determine array element types in the same way. • Merge schemata of single records in one reduce operation. • Same trick for Python typing Set Footer from Insert Dropdown Menu 13
  • 14.
    The not-so-secret truth... 14 isabout more than SQL. SQL
  • 15.
    : Declarative BigDataProcessing Let Developers Create and Run Spark Programs Faster: • Write less code • Read less data • Let the optimizer do the hard work 15 SQL
  • 16.
    DataFrame noun – [dey-tuh-freym] 16 1.A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).
  • 17.
    Write Less Code:Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect()
  • 18.
    Write Less Code:Compute an Average 18 Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name Using Pig P = load '/people' as (name, name); G = group P by name; R = foreach G generate … AVG(G.age);
  • 19.
    Seamlessly Integrated: RDDs Internally,DataFrame execution is done with Spark RDDs making interoperation with outside sources and custom algorithms easy. Set Footer from Insert Dropdown Menu 19 External Input def buildScan( requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] Custom Processing queryResult.rdd.mapPartitions { iter => … Your code here … }
  • 20.
    Extensible Input &Output Spark’s Data Source API allows optimizations like column pruning and filter pushdown into custom data sources. 20 { JSON } Built-In External JDBC and more…
  • 21.
    Seamlessly Integrated Embedding ina full programming language makes UDFs trivial and allows composition using functions. 21 zipToCity = udf(lambda city: <custom logic here>) def add_demographics(events): u = sqlCtx.table("users") events .join(u, events.user_id == u.user_id) .withColumn("city", zipToCity(df.zip)) Takes and returns a DataFram e

Editor's Notes

  • #3 Some of the limitations of Spark RDD were- It does not have any built-in optimization engine. There is no provision to handle structured data.
  • #9 traditional sql convenient for computing multiple things at the same time
  • #11 Done in Scala because functional programming languages naturally support compiler functions.