Real-Time Analytics with Apache Cassandra and Apache Spark

BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH Real-Time Analytics with Apache Cassandra and Apache Spark Guido Schmutz

Guido Schmutz • Working for Trivadis for more than 18 years • Oracle ACE Director for Fusion Middleware and SOA • Author of different books • Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data • Technology Manager @ Trivadis • More than 25 years of software development experience • Contact: guido.schmutz@trivadis.com • Blog: http://guidoschmutz.wordpress.com • Twitter: gschmutz

Agenda 1. Introduction 2. Apache Spark 3. Apache Cassandra 4. Combining Spark & Cassandra 5. Summary

Big Data Definition (4 Vs) + Time to action ? – Big Data + Real-Time = Stream Processing Characteristics of Big Data: Its Volume, Velocity and Variety in combination

What is Real-Time Analytics? What is it? Why do we need it? How does it work? • Collect real-time data • Process data as it flows in • Data in Motion over Data at Rest • Reports and Dashboard access processed data Time Events RespondAnalyze Short time to analyze & respond § Required - for new business models § Desired - for competitive advantage

Real Time Analytics Use Cases • Algorithmic Trading • Online Fraud Detection • Geo Fencing • Proximity/Location Tracking • Intrusion detection systems • Traffic Management • Recommendations • Churn detection • Internet of Things (IoT) / Intelligence Sensors • Social Media/Data Analytics • Gaming Data Feed • …

Motivation – Why Apache Spark? Hadoop MapReduce: Data Sharing on Disk Spark: Speed up processing by using Memory instead of Disks map reduce . . . Input HDFS read HDFS write HDFS read HDFS write op1 op2 . . . Input Output Output

Apache Spark Apache Spark is a fast and general engine for large-scale data processing • The hot trend in Big Data! • Originally developed 2009 in UC Berkley’s AMPLab • Based on 2007 Microsoft Dryad paper • Written in Scala, supports Java, Python, SQL and R • Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • One of the largest OSS communities in big data with over 200 contributors in 50+ organizations • Open Sourced in 2010 – since 2014 part of Apache Software foundation

Apache Spark Spark SQL (Batch Processing) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLlib, Spark R (Machine Learning) GraphX (Graph Processing) Spark Core API and Execution Model Spark Standalone MESOS YARN HDFS Elastic Search NoSQL S3 Libraries Core Runtime Cluster Resource Managers Data Stores

Resilient Distributed Dataset (RDD) Are • Immutable • Re-computable • Fault tolerant • Reusable Have Transformations • Produce new RDD • Rich set of transformation available • filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ... Have Actions • Start cluster computing operations • Rich set of action available • collect(), count(), fold(), reduce(), count(), …

RDD RDD Input Source • File • Database • Stream • Collection .count() -> 100 Data

Partitions RDD Data Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 Partition 8 Partition 9 Server 1 Server 2 Server 3 Server 4 Server 5

Partitions RDD Data Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 Partition 8 Partition 9 Server 2 Server 3 Server 4 Server 5

Stage 1 – reduceByKey() Stage 1 – flatMap() + map() Spark Workflow Input HDFS File HadoopRDD MappedRDD ShuffledRDD Text File Output sc.hapoopFile() map() reduceByKey() sc.saveAsTextFile() Transformations (Lazy) Action (Execute Transformations) Master MappedRDD P0 P1 P3 ShuffledRDD P0 MappedRDD flatMap() DAG Scheduler

Spark Workflow HDFS File Input 1 HadoopRDD FilteredRDD MappedRDD ShuffledRDD HDFS File Output HadoopRDD MappedRDD HDFS File Input 2 SparkContext.hadoopFile() SparkContext.hadoopFile()filter() map() map() join() SparkContext.saveAsHadoopFile() Transformations (Lazy) Action (Execute Transformations)

Spark Execution Model Data Storage Worker Master Executer Executer Server Executer

Stage 1 – flatMap() + map() Spark Execution Model Data Storage Worker Master Executer Data Storage Worker Executer Data Storage Worker Executer RDD P0 P1 P3 Narrow TransformationMaster filter() map() sample() flatMap() Data Storage Worker Executer

Stage 2 – reduceByKey() Spark Execution Model Data Storage Worker Executer Data Storage Worker Executer RDD P0 Wide Transformation Master join() reduceByKey() union() groupByKey() Shuffle ! Data Storage Worker Executer Data Storage Worker Executer

Batch vs. Real-Time Processing Petabytes of Data Gigabytes Per Second

Apache Kafka distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …) Initially developed at LinkedIn, now part of Apache Does not use JMS API and standards Kafka maintains feeds of messages in topics Kafka Cluster Consumer Consumer Consumer Producer Producer Producer

Apache Kafka Kafka Broker Temperature Processor Temperature Topic Rainfall Topic 1 2 3 4 5 6 Rainfall Processor1 2 3 4 5 6 Weather Station

Apache Kafka Kafka Broker Temperature Processor Temperature Topic Rainfall Topic 1 2 3 4 5 6 Rainfall Processor Partition 0 1 2 3 4 5 6 Partition 0 1 2 3 4 5 6 Partition 1 Temperature Processor Weather Station

Apache Kafka Kafka Broker Temperature Processor Weather Station Temperature Topic Rainfall Topic Rainfall Processor P 0 Temperature Processor 1 2 3 4 5 P 1 1 2 3 4 5 Kafka Broker Temperature Topic Rainfall Topic P 0 1 2 3 4 5 P 1 1 2 3 4 5 P 0 1 2 3 4 5 P 0 1 2 3 4 5

Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station

Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station Discrete by time Individual Event DStream = RDD

Discretized Stream (DStream) DStream DStream X Seconds Transform .countByValue() .reduceByKey() .join .map

Discretized Stream (DStream) time 1 time 2 time 3 message time n…. f(message 1) RDD @time 1 f(message 2) f(message n) …. message 1 RDD @time 1 message 2 message n …. result 1 result 2 result n …. message message message f(message 1) RDD @time 2 f(message 2) f(message n) …. message 1 RDD @time 2 message 2 message n …. result 1 result 2 result n …. f(message 1) RDD @time 3 f(message 2) f(message n) …. message 1 RDD @time 3 message 2 message n …. result 1 result 2 result n …. f(message 1) RDD @time n f(message 2) f(message n) …. message 1 RDD @time n message 2 message n …. result 1 result 2 result n …. Input Stream Event DStream MappedDStream map() saveAsHadoopFiles() Time Increasing DStreamTransformation Lineage Actions Trigger Spark Jobs Adapted from Chris Fregly: http://slidesha.re/11PP7FV

Apache Spark Streaming – Core concepts Discretized Stream (DStream) • Core Spark Streaming abstraction • micro batches of RDD’s • Operations similar to RDD Input DStreams • Represents the stream of raw data received from streaming sources • Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. • Custom Sources can be easily written for custom data sources Operations • Same as Spark Core + Additional Stateful transformations (window, reduceByWindow)

Apache Cassandra Apache Cassandra™ is a free • Distributed… • High performance… • Extremely scalable… • Fault tolerant (i.e. no single point of failure)… post-relational database solution Optimized for high write throughput

Apache Cassandra - History Bigtable Dynamo

Motivation - Why NoSQL Databases? aaa • Dynamo Paper (2007) • How to build a data store that is • Reliable • Performant • “Always On” • Nothing new and shiny • 24 other papers cited • Evolutionary

Motivation - Why NoSQL Databases? • Google Big Table (2006) • Richer data model • 1 key and lot’s of values • Fast sequential access • 38 other papers cited

Motivation - Why NoSQL Databases? • Cassandra Paper (2008) • Distributed features of Dynamo • Data Model and storage from BigTable • February 2010 graduated to a top-level Apache Project

Apache Cassandra – More than one server All nodes participate in a cluster Shared nothing Add or remove as needed More capacity? Add more servers Node is a basic unit inside a cluster Each node owns a range of partitions Consistent Hashing Node 1 Node 2 Node 3 Node 4 [26-50] [0-25] [51-75] [76-100] [0-25] [0-25] [26-50] [26-50] [51-75] [51-75] [76-100] [76-100]

Apache Cassandra – Fully Replicated Client writes local Data syncs across WAN Replication per Data Center Node 1 Node 2 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4 West East Client

Apache Cassandra What is Cassandra NOT? • A Data Ocean • A Data Lake • A Data Pond • An In-Memory Database • A Key-Value Store • Not for Data Warehousing What are good use cases? • Product Catalog / Playlists • Personalization (Ads, Recommendations) • Fraud Detection • Time Series (Finance, Smart Meter) • IoT / Sensor Data • Graph / Network data

How Cassandra stores data • Model brought from Google Bigtable • Row Key and a lot of columns • Column names sorted (UTF8, Int, Timestamp, etc.) Column Name … Column Name Column Value Column Value Timestamp Timestamp TTL TTL Row Key 1 2 Billion Billion of Rows

Spark and Cassandra Architecture – Great Combo Good at analyzing a huge amount of data Good at storing a huge amount of data

Spark and Cassandra Architecture Spark Streaming (Near Real-Time) SparkSQL (Structured Data) MLlib (Machine Learning) GraphX (Graph Analysis)

Spark and Cassandra Architecture Spark Connector Weather Station Spark Streaming (Near Real-Time) SparkSQL (Structured Data) MLlib (Machine Learning) GraphX (Graph Analysis) Weather Station Weather Station Weather Station Weather Station

Spark and Cassandra Architecture • Single Node running Cassandra • Spark Worker is really small • Spark Master lives outside a node • Spark Worker starts Spark Executer in separate JVM • Node local Worker Master Executer Executer Server Executer

Spark and Cassandra Architecture Worker Worker Worker Master Worker • Each node runs Spark and Cassandra • Spark Master can make decisions based on Token Ranges • Spark likes to work on small partitions of data across a large cluster • Cassandra likes to spread out data in a large cluster 0-25 26-50 51-75 76-100 Will only have to analyze 25% of data!

Spark and Cassandra Architecture Master 0-25 26-50 51-75 76-100 Worker Worker WorkerWorker 0-25 26-50 51-75 76-100 Transactional Analytics

Cassandra and Spark Cassandra Cassandra & Spark Joins and Unions No Yes Transformations Limited Yes Outside Data Integration No Yes Aggregations Limited Yes

Summary Kafka • Topics store information broken into partitions • Brokers store partitions • Partitions are replicated for data resilience Cassandra • Goals of Apache Cassandra are all about staying online and performant • Best for applications close to your users • Partitions are similar data grouped by a partition key Spark • Replacement for Hadoop Map Reduce • In memory • More operations than just Map and Reduce • Makes data analysis easier • Spark Streaming can take a variety of sources Spark + Cassandra • Cassandra acts as the storage layer for Spark • Deploy in a mixed cluster configuration • Spark executors access Cassandra using the DataStax connector

Lambda Architecture with Spark/Cassandra Data Collection (Analytical) Batch Data Processing Batch compute Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir)

Guido Schmutz Technology Manager guido.schmutz@trivadis.com

Real-Time Analytics with Apache Cassandra and Apache Spark

More Related Content

What's hot

Similar to Real-Time Analytics with Apache Cassandra and Apache Spark

More from Guido Schmutz

Recently uploaded

Real-Time Analytics with Apache Cassandra and Apache Spark