Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn
Spark SQL is a module of Apache Spark designed for handling structured and semi-structured data, improving upon the limitations of Apache Hive by offering better performance and fault tolerance. It features a robust architecture with support for multiple programming languages and data sources, leveraging DataFrames and a Catalyst optimizer for efficient query execution. Users can run SQL queries and process large datasets seamlessly using Spark's integrated capabilities.
Spark SQL is Apache Spark's module for structured and semi-structured data, overcoming Hive's limits. It enhances performance and job resume capabilities.
Spark SQL is Apache Spark's module for structured and semi-structured data, overcoming Hive's limits. It enhances performance and job resume capabilities.
Spark SQL features include high compatibility, integration within Spark, scalability, and support for JDBC/ODBC connectivity.
Spark SQL architecture consists of three layers supporting various data sources and programming languages, enabling structured data manipulation.
The DataFrame API facilitates working with structured/semi-structured data, inspired by R and Python, processing up to petabytes on a single cluster.
Spark SQL supports various data sources (CSV, Avro, etc.) via the DataFrame interface, lazily evaluated and integrating with Big Data tools.
Catalyst Optimizer is a key feature of Spark SQL, enhancing query optimization through a multi-phase process leveraging Scala.
SQLContext initializes Spark SQL functionalities, requiring SparkContext, while SparkSession serves as the entry point for Spark applications.
Applications can create DataFrames from RDDs or data sources, utilizing domain-specific language for structured data manipulation.
Spark SQL allows running SQL queries through the sql function on a SparkSession, returning results as DataFrames.
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn
2.
What is SparkSQL? Spark SQL Features Spark SQL Architecture Spark SQL – DataFrame API Spark SQL – Data Source API Spark SQL – Catalyst Optimizer Running SQL Queries Spark SQL Demo What’s in it for you? SQL
3.
What is SparkSQL? SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data
SQL Spark SQL isApache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive What is Spark SQL?
6.
SQL Spark SQL isApache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive Hive lags in performance as it uses MapReduce jobs for executing ad-hoc queries Hive does not allow you to resume a job processing if it fails in the middle Limitations What is Spark SQL?
7.
SQL Spark performs betterthan Hive in most scenarios Source: https://engineering.fb.com/ Hive ~ Spark
SQL Integrated High Compatibility You can integrateSpark SQL and query structured data inside Spark programs You can run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing Spark SQL Features Spark SQL Spark programs SQLQueries
10.
SQL Scalability Standard Connectivity Spark SQL leveragesRDD model as it supports large jobs and mid- query fault tolerance. For interactive and long queries, it uses the same engine You can easily connect Spark SQL with JDBC or ODBC. For connectivity for business intelligence tools, both turned as industry norms Spark SQL Features SQL SQL RDD Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing
Spark SQL hasthree main layers Spark SQL is Apache Spark’s module for working with structured data Language API SchemaRDD Data Sources Spark is very compatible as it supports languages like Python, HiveQL, Scala, and Java As Spark SQL works on schema, tables, and records, you can use SchemaRDD or DataFrame as a temporary table SQL Spark SQL supports multiple data sources like JSON, Cassandra database, Hive tables Spark SQL Architecture
A DataFrame isa domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema Spark SQL – Data Frame API
16.
DataFrame API inSpark was designed taking inspiration from DataFrame in R programming and Pandas in Python Spark SQL – Data Frame API A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema
17.
Has can processthe data in the size of Kilobytes to Petabytes on a single node cluster Can be easily integrated with all Big Data tools and frameworks via Spark-Core Provides API for Python, Java, Scala, and R Programming DataFrame features Spark SQL – Data Frame API DataFrame API in Spark was designed taking inspiration from DataFrame in R programming and Pandas in Python A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema
Spark SQL supportsoperating on a variety of data sources through the DataFrame interface Spark SQL – Data Source API
20.
Spark SQL supportsoperating on a variety of data sources through the DataFrame interface It supports different files such as CSV, Hive, Avro, JSON, Parquet Spark SQL – Data Source API
21.
It supports differentfiles such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context ContextSQL Spark SQL – Data Source API Spark SQL supports operating on a variety of data sources through the DataFrame interface
22.
It can beeasily integrated with all Big Data tools and frameworks via Spark-Core ContextSQL Spark SQL – Data Source API It supports different files such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context Spark SQL supports operating on a variety of data sources through the DataFrame interface
Catalyst optimizer leveragesadvanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer Spark SQL – Catalyst Optimizer
25.
It works in4 phases: 1 Analyzing a logical plan to resolve references 2 Logical plan optimization 3 Physical planning 4 Code generation to compile parts of the query to Java bytecode Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
26.
SQL Query SQL Query Spark SQL –Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
27.
SQL Query SQL Query Unresolved Logical plan Spark SQL– Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Catalog Analysis Logical Optimization Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
30.
SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Physical plans Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
31.
SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Physical plans Cost Model Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
32.
SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Physical plans Cost Model Selected Physical Plan Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
33.
SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logicalplan Physical plans Cost Model Selected Physical Plan RDDs Catalog Analysis Logical Optimization Physical Planning Code Generation Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
SparkContext class object(sc) is required for initializing SQLContext class object The following command initializes SparkContext through spark-shell $ spark-shell Spark SQLContext SQLContext is a class used for initializing the functionalities of Spark SQL
37.
The following commandcreates a SQLContext scala> val sqlcontext = new org.apache.sql.SQLContext(sc) Spark SQLContext SparkContext class object (sc) is required for initializing SQLContext class object SQLContext is a class used for initializing the functionalities of Spark SQL The following command initializes SparkContext through spark-shell $ spark-shell
It is theentry point to any functionality in Spark. To create a basic SparkSession, use SparkSession.builder() Source: https://spark.apache.org/ SparkSession
40.
Applications can createDataFrames with the help of an existing RDD using a Hive table, or from Spark data sources The following creates a DataFrame based on the content of a JSON file: https://spark.apache.org/Source: Creating DataFrames
Structured data canbe manipulated using domain-specific language provided by DataFrames https://spark.apache.org/Source: DataFrame Operations Below are some examples of structured data processing:
The sql functionon a SparkSession allows applications to run SQL queries programmatically and returns the result in the form of a DataFrame https://spark.apache.org/Source: Running SQL Queries