Spark Sql and DataFrame

Spark SQL: Relational Data Processing in Spark

Challenges and Solutions Challenges • Perform ETL to and from various (semi- or unstructured) data sources • Perform advanced analytics (e.g. machine learning, graph processing) that are hard to express in relational systems. Solutions • A DataFrame API that can perform relational operations on both external data sources and Spark’s built-in RDDs. • A highly extensible optimizer, Catalyst, that uses features of Scala to add composable rule, control code gen., and define extensions. 2

3 SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data) Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments About SQL

Improvement upon Existing Art Engine does not understand the structure of the data in RDDs or the semantics of user functions  limited optimization. Can only be used to query external data in Hive catalog  limited data sources Can only be invoked via SQL string from Spark error prone Hive optimizer tailored for MapReduce  difficult to extend Set Footer from Insert Dropdown Menu 4

Programming Interface Set Footer from Insert Dropdown Menu 5

DataFrame • A distributed collection of rows with the same schema (RDDs suffer from type erasure) • Supports relational operators (e.g. where, groupby) as well as Spark operations. Set Footer from Insert Dropdown Menu 6

Data Model • Nested data model • Supports both primitive SQL types (boolean, integer, double, decimal, string, data, timestamp) and complex types (structs, arrays, maps, and unions); also user defined types. • First class support for complex data types Set Footer from Insert Dropdown Menu 7

DataFrame Operations • Relational operations (select, where, join, groupBy) via a DSL • Operators take expression objects • Operators build up an abstract syntax tree (AST), which is then optimized by Catalyst. • Alternatively, register as temp SQL table and perform traditional SQL query strings Set Footer from Insert Dropdown Menu 8

Advantages over Relational Query Languages • Holistic optimization across functions composed in different languages. • Control structures (e.g. if, for) • Logical plan analyzed eagerly  identify code errors associated with data schema issues on the fly. Set Footer from Insert Dropdown Menu 9

Catalyst Set Footer from Insert Dropdown Menu 10 Add Attribute(x) Literal(3) x + (1 + 2) x + 3

Plan Optimization & Execution 11 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline

An Example Catalyst Transformation Set Footer from Insert Dropdown Menu 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down

Advanced Analytics Features Schema Inference for Semistructured Data JSON • Automatically infers schema from a set of records, in one pass or sample • A tree of STRUCT types, each of which may contain atoms, arrays, or other STRUCTs. • Find the most appropriate type for a field based on all data observed in that column. Determine array element types in the same way. • Merge schemata of single records in one reduce operation. • Same trick for Python typing Set Footer from Insert Dropdown Menu 13

The not-so-secret truth... 14 is about more than SQL. SQL

: Declarative BigData Processing Let Developers Create and Run Spark Programs Faster: • Write less code • Read less data • Let the optimizer do the hard work 15 SQL

DataFrame noun – [dey-tuh-freym] 16 1. A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).

Write Less Code: Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect()

Write Less Code: Compute an Average 18 Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name Using Pig P = load '/people' as (name, name); G = group P by name; R = foreach G generate … AVG(G.age);

Seamlessly Integrated: RDDs Internally, DataFrame execution is done with Spark RDDs making interoperation with outside sources and custom algorithms easy. Set Footer from Insert Dropdown Menu 19 External Input def buildScan( requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] Custom Processing queryResult.rdd.mapPartitions { iter => … Your code here … }

Extensible Input & Output Spark’s Data Source API allows optimizations like column pruning and filter pushdown into custom data sources. 20 { JSON } Built-In External JDBC and more…

Seamlessly Integrated Embedding in a full programming language makes UDFs trivial and allows composition using functions. 21 zipToCity = udf(lambda city: <custom logic here>) def add_demographics(events): u = sqlCtx.table("users") events .join(u, events.user_id == u.user_id) .withColumn("city", zipToCity(df.zip)) Takes and returns a DataFram e

Spark Sql and DataFrame

More Related Content

What's hot

Similar to Spark Sql and DataFrame

More from Prashant Gupta

Recently uploaded

Spark Sql and DataFrame

Editor's Notes