Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

What is Spark SQL? Spark SQL Features Spark SQL Architecture Spark SQL – DataFrame API Spark SQL – Data Source API Spark SQL – Catalyst Optimizer Running SQL Queries Spark SQL Demo What’s in it for you? SQL

What is Spark SQL? SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data

SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive What is Spark SQL?

SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive Hive lags in performance as it uses MapReduce jobs for executing ad-hoc queries Hive does not allow you to resume a job processing if it fails in the middle Limitations What is Spark SQL?

SQL Spark performs better than Hive in most scenarios Source: https://engineering.fb.com/ Hive ~ Spark

SQL Integrated High Compatibility You can integrate Spark SQL and query structured data inside Spark programs You can run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing Spark SQL Features Spark SQL Spark programs SQLQueries

SQL Scalability Standard Connectivity Spark SQL leverages RDD model as it supports large jobs and mid- query fault tolerance. For interactive and long queries, it uses the same engine You can easily connect Spark SQL with JDBC or ODBC. For connectivity for business intelligence tools, both turned as industry norms Spark SQL Features SQL SQL RDD Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing

SQL DataFrame DSLDataframe DSL DataFrame API Data Source API CSV JSON JDBC DataFrame DSLSpark SQL and HQL Spark SQL Architecture

Spark SQL has three main layers Spark SQL is Apache Spark’s module for working with structured data Language API SchemaRDD Data Sources Spark is very compatible as it supports languages like Python, HiveQL, Scala, and Java As Spark SQL works on schema, tables, and records, you can use SchemaRDD or DataFrame as a temporary table SQL Spark SQL supports multiple data sources like JSON, Cassandra database, Hive tables Spark SQL Architecture

A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema Spark SQL – Data Frame API

DataFrame API in Spark was designed taking inspiration from DataFrame in R programming and Pandas in Python Spark SQL – Data Frame API A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema

Has can process the data in the size of Kilobytes to Petabytes on a single node cluster Can be easily integrated with all Big Data tools and frameworks via Spark-Core Provides API for Python, Java, Scala, and R Programming DataFrame features Spark SQL – Data Frame API DataFrame API in Spark was designed taking inspiration from DataFrame in R programming and Pandas in Python A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema

Spark SQL supports operating on a variety of data sources through the DataFrame interface Spark SQL – Data Source API

Spark SQL supports operating on a variety of data sources through the DataFrame interface It supports different files such as CSV, Hive, Avro, JSON, Parquet Spark SQL – Data Source API

It supports different files such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context ContextSQL Spark SQL – Data Source API Spark SQL supports operating on a variety of data sources through the DataFrame interface

It can be easily integrated with all Big Data tools and frameworks via Spark-Core ContextSQL Spark SQL – Data Source API It supports different files such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context Spark SQL supports operating on a variety of data sources through the DataFrame interface

Spark SQL – Catalyst Optimizer

Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer Spark SQL – Catalyst Optimizer

It works in 4 phases: 1 Analyzing a logical plan to resolve references 2 Logical plan optimization 3 Physical planning 4 Code generation to compile parts of the query to Java bytecode Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

SQL Query SQL Query Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

SQL Query SQL Query Unresolved Logical plan Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

SQL Query SQL Query Unresolved Logical plan Logical plan Catalog Analysis Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Catalog Analysis Logical Optimization Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Selected Physical Plan Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Selected Physical Plan RDDs Catalog Analysis Logical Optimization Physical Planning Code Generation Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer

Spark SQLContext SQLContext is a class used for initializing the functionalities of Spark SQL

SparkContext class object (sc) is required for initializing SQLContext class object The following command initializes SparkContext through spark-shell $ spark-shell Spark SQLContext SQLContext is a class used for initializing the functionalities of Spark SQL

The following command creates a SQLContext scala> val sqlcontext = new org.apache.sql.SQLContext(sc) Spark SQLContext SparkContext class object (sc) is required for initializing SQLContext class object SQLContext is a class used for initializing the functionalities of Spark SQL The following command initializes SparkContext through spark-shell $ spark-shell

It is the entry point to any functionality in Spark. To create a basic SparkSession, use SparkSession.builder() Source: https://spark.apache.org/ SparkSession

Applications can create DataFrames with the help of an existing RDD using a Hive table, or from Spark data sources The following creates a DataFrame based on the content of a JSON file: https://spark.apache.org/Source: Creating DataFrames

Structured data can be manipulated using domain-specific language provided by DataFrames https://spark.apache.org/Source: DataFrame Operations Below are some examples of structured data processing:

https://spark.apache.org/Source: DataFrame Operations Structured data can be manipulated using domain-specific language provided by DataFrames Below are some examples of structured data processing:

The sql function on a SparkSession allows applications to run SQL queries programmatically and returns the result in the form of a DataFrame https://spark.apache.org/Source: Running SQL Queries

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

In this document

More Related Content

What's hot

Similar to Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

More from Simplilearn

Recently uploaded

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn