Big Data Processing With Spark and Scala http://www.edureka.co/apache-spark-scala-training
Slide 2Slide 2 http://www.edureka.co/apache-spark-scala-training What is Big Data? What is Spark? Why Spark? Spark Ecosystem A note about Scala Why Scala? MapReduce vs Spark Hello Spark! Objectives of this Session
Slide 3Slide 3 http://www.edureka.co/apache-spark-scala-training Big Data  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization cloud tools statistics No SQL compression storage support database analyze information terabytes processing mobile Big Data
Slide 4Slide 4 http://www.edureka.co/apache-spark-scala-training What is Spark?  Apache Spark is a general-purpose cluster in-memory computing system  Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs  Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more.. High Level APIs High Level Tools More…
Slide 5Slide 5 http://www.edureka.co/apache-spark-scala-training Why Spark? Cluster Manager Deployment via YARN  The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager.
Slide 6Slide 6 http://www.edureka.co/apache-spark-scala-training Why Spark? Polyglot Scala  Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and Python supported).
Slide 7Slide 7 http://www.edureka.co/apache-spark-scala-training Why Spark? A fully Apache Hive compatible data warehousing system that can run 100x faster than Hive. 100x faster than for certain applications.
Slide 8Slide 8 http://www.edureka.co/apache-spark-scala-training Why Spark?  Provides powerful caching and disk persistence capabilities  Interactive Data Analysis  Faster Batch  Iterative Algorithms  Real-Time Stream Processing  Faster Decision-Making
Slide 9Slide 9 http://www.edureka.co/apache-spark-scala-training Spark Community is Super Active!
Slide 10Slide 10 http://www.edureka.co/apache-spark-scala-training Spark Ecosystem Spark Core Engine Aplha/Pre-alpha Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL)
Slide 11Slide 11 http://www.edureka.co/apache-spark-scala-training Spark Ecosystem (Contd.) Used for structured data. Can run unmodified hive queries on existing Hadoop deployment. Spark Core Engine Aplha/Pre-alpha Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL) Enables analytical and interactive apps for live streaming data. An approximate query engine. To run over Core Spark Engine. Graph Computation engine. (Similar to Giraph) Package for R language to enable R-users to leverage Spark power from R shell. Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce.
Slide 12Slide 12 http://www.edureka.co/apache-spark-scala-training A Note on Scala  Scala is a general-purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way  Scala supports both Object Oriented Programming and Functional Programming  Scala is very much in fabric of present and Future Big Data frameworks like Scalding, Spark, Akka » All examples of Spark in class will be covered in Scala » Scala would be covered before Spark coverage as part of course!
Slide 13Slide 13 http://www.edureka.co/apache-spark-scala-training Why Scala?  Scala is a pure object-oriented language. Conceptually, every value is an object and every operation is a method-call. The language supports advanced component architectures through classes and traits  Scala is also a functional language. Supports functions, immutable data structures and preference for immutability over mutation  Seamlessly integrated with Java  Being used heavily for future Big data and developments frameworks like Spark, Akka, Scalding, Play etc
Slide 14Slide 14 http://www.edureka.co/apache-spark-scala-trainingSlide 14  If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly  Hadoop works on Batch processing, hence response time is high Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n Input Data Processing Data Input Data Processing Data Input Data Processing Data Input Data Processing Data using MR Time Lag Real Time Analytics
Slide 15Slide 15 http://www.edureka.co/apache-spark-scala-trainingSlide 15 Real Time Analytics – Accepted Way Streaming Data Storing
Slide 16Slide 16 http://www.edureka.co/apache-spark-scala-trainingSlide 16 14 sec 0.6 sec MapReduce vs Spark
Slide 17 http://www.edureka.co/apache-spark-scala-training Spark Demo! Spark Demo!
Slide 18 http://www.edureka.co/apache-spark-scala-training Questions?
Big Data Processing with Spark and Scala

Big Data Processing with Spark and Scala

  • 1.
    Big Data ProcessingWith Spark and Scala http://www.edureka.co/apache-spark-scala-training
  • 2.
    Slide 2Slide 2http://www.edureka.co/apache-spark-scala-training What is Big Data? What is Spark? Why Spark? Spark Ecosystem A note about Scala Why Scala? MapReduce vs Spark Hello Spark! Objectives of this Session
  • 3.
    Slide 3Slide 3http://www.edureka.co/apache-spark-scala-training Big Data  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization cloud tools statistics No SQL compression storage support database analyze information terabytes processing mobile Big Data
  • 4.
    Slide 4Slide 4http://www.edureka.co/apache-spark-scala-training What is Spark?  Apache Spark is a general-purpose cluster in-memory computing system  Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs  Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more.. High Level APIs High Level Tools More…
  • 5.
    Slide 5Slide 5http://www.edureka.co/apache-spark-scala-training Why Spark? Cluster Manager Deployment via YARN  The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager.
  • 6.
    Slide 6Slide 6http://www.edureka.co/apache-spark-scala-training Why Spark? Polyglot Scala  Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and Python supported).
  • 7.
    Slide 7Slide 7http://www.edureka.co/apache-spark-scala-training Why Spark? A fully Apache Hive compatible data warehousing system that can run 100x faster than Hive. 100x faster than for certain applications.
  • 8.
    Slide 8Slide 8http://www.edureka.co/apache-spark-scala-training Why Spark?  Provides powerful caching and disk persistence capabilities  Interactive Data Analysis  Faster Batch  Iterative Algorithms  Real-Time Stream Processing  Faster Decision-Making
  • 9.
    Slide 9Slide 9http://www.edureka.co/apache-spark-scala-training Spark Community is Super Active!
  • 10.
    Slide 10Slide 10http://www.edureka.co/apache-spark-scala-training Spark Ecosystem Spark Core Engine Aplha/Pre-alpha Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL)
  • 11.
    Slide 11Slide 11http://www.edureka.co/apache-spark-scala-training Spark Ecosystem (Contd.) Used for structured data. Can run unmodified hive queries on existing Hadoop deployment. Spark Core Engine Aplha/Pre-alpha Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL) Enables analytical and interactive apps for live streaming data. An approximate query engine. To run over Core Spark Engine. Graph Computation engine. (Similar to Giraph) Package for R language to enable R-users to leverage Spark power from R shell. Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce.
  • 12.
    Slide 12Slide 12http://www.edureka.co/apache-spark-scala-training A Note on Scala  Scala is a general-purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way  Scala supports both Object Oriented Programming and Functional Programming  Scala is very much in fabric of present and Future Big Data frameworks like Scalding, Spark, Akka » All examples of Spark in class will be covered in Scala » Scala would be covered before Spark coverage as part of course!
  • 13.
    Slide 13Slide 13http://www.edureka.co/apache-spark-scala-training Why Scala?  Scala is a pure object-oriented language. Conceptually, every value is an object and every operation is a method-call. The language supports advanced component architectures through classes and traits  Scala is also a functional language. Supports functions, immutable data structures and preference for immutability over mutation  Seamlessly integrated with Java  Being used heavily for future Big data and developments frameworks like Spark, Akka, Scalding, Play etc
  • 14.
    Slide 14Slide 14http://www.edureka.co/apache-spark-scala-trainingSlide 14  If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly  Hadoop works on Batch processing, hence response time is high Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n Input Data Processing Data Input Data Processing Data Input Data Processing Data Input Data Processing Data using MR Time Lag Real Time Analytics
  • 15.
    Slide 15Slide 15http://www.edureka.co/apache-spark-scala-trainingSlide 15 Real Time Analytics – Accepted Way Streaming Data Storing
  • 16.
    Slide 16Slide 16http://www.edureka.co/apache-spark-scala-trainingSlide 16 14 sec 0.6 sec MapReduce vs Spark
  • 17.
  • 18.