BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH Real-Time Analytics with Apache Cassandra and Apache Spark Guido Schmutz
Guido Schmutz • Working for Trivadis for more than 18 years • Oracle ACE Director for Fusion Middleware and SOA • Author of different books • Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data • Technology Manager @ Trivadis • More than 25 years of software development experience • Contact: guido.schmutz@trivadis.com • Blog: http://guidoschmutz.wordpress.com • Twitter: gschmutz
Agenda 1. Introduction 2. Apache Spark 3. Apache Cassandra 4. Combining Spark & Cassandra 5. Summary
Big Data Definition (4 Vs) +	Time	to	action	?	– Big	Data	+	Real-Time	=	Stream	Processing Characteristics	of	Big	Data:	Its	Volume, Velocity	and	Variety	in	combination
What is Real-Time Analytics? What is it? Why do we need it? How does it work? • Collect real-time data • Process data as it flows in • Data in Motion over Data at Rest • Reports and Dashboard access processed data Time Events RespondAnalyze Short	time	to analyze	& respond § Required	- for	new	business	models § Desired	- for	competitive	advantage
Real Time Analytics Use Cases • Algorithmic Trading • Online Fraud Detection • Geo Fencing • Proximity/Location Tracking • Intrusion detection systems • Traffic Management • Recommendations • Churn detection • Internet of Things (IoT) / Intelligence Sensors • Social Media/Data Analytics • Gaming Data Feed • …
Apache Spark
Motivation – Why Apache Spark? Hadoop MapReduce: Data Sharing on Disk Spark: Speed up processing by using Memory instead of Disks map reduce . . . Input HDFS read HDFS write HDFS read HDFS write op1 op2 . . . Input Output Output
Apache Spark Apache Spark is a fast and general engine for large-scale data processing • The hot trend in Big Data! • Originally developed 2009 in UC Berkley’s AMPLab • Based on 2007 Microsoft Dryad paper • Written in Scala, supports Java, Python, SQL and R • Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • One of the largest OSS communities in big data with over 200 contributors in 50+ organizations • Open Sourced in 2010 – since 2014 part of Apache Software foundation
Apache Spark Spark	SQL (Batch	Processing) Blink	DB (Approximate Querying) Spark	Streaming (Real-Time) MLlib,	Spark	R (Machine Learning) GraphX (Graph	Processing) Spark	Core	API	and	Execution	Model Spark Standalone MESOS YARN HDFS Elastic Search NoSQL S3 Libraries Core	Runtime Cluster	Resource	Managers Data	Stores
Resilient Distributed Dataset (RDD) Are • Immutable • Re-computable • Fault tolerant • Reusable Have Transformations • Produce new RDD • Rich set of transformation available • filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ... Have Actions • Start cluster computing operations • Rich set of action available • collect(), count(), fold(), reduce(), count(), …
RDD RDD Input Source • File • Database • Stream • Collection .count() ->	100 Data
Partitions RDD Data Partition	0 Partition	1 Partition	2 Partition	3 Partition	4 Partition	5 Partition	6 Partition	7 Partition	8 Partition	9 Server	1 Server	2 Server	3 Server	4 Server	5
Partitions RDD Data Partition	0 Partition	1 Partition	2 Partition	3 Partition	4 Partition	5 Partition	6 Partition	7 Partition	8 Partition	9 Server	1 Server	2 Server	3 Server	4 Server	5
Partitions RDD Data Partition	0 Partition	1 Partition	2 Partition	3 Partition	4 Partition	5 Partition	6 Partition	7 Partition	8 Partition	9 Server	2 Server	3 Server	4 Server	5
Stage 1 – reduceByKey() Stage 1 – flatMap() + map() Spark Workflow Input	HDFS	File HadoopRDD MappedRDD ShuffledRDD Text	File	Output sc.hapoopFile() map() reduceByKey() sc.saveAsTextFile() Transformations (Lazy) Action (Execute Transformations) Master MappedRDD P0 P1 P3 ShuffledRDD P0 MappedRDD flatMap() DAG Scheduler
Spark Workflow HDFS	File	Input	1 HadoopRDD FilteredRDD MappedRDD ShuffledRDD HDFS	File	Output HadoopRDD MappedRDD HDFS	File	Input	2 SparkContext.hadoopFile() SparkContext.hadoopFile()filter() map() map() join() SparkContext.saveAsHadoopFile() Transformations (Lazy) Action (Execute	Transformations)
Spark Execution Model Data Storage Worker Master Executer Executer Server Executer
Stage 1 – flatMap() + map() Spark Execution Model Data Storage Worker Master Executer Data Storage Worker Executer Data Storage Worker Executer RDD P0 P1 P3 Narrow	TransformationMaster filter() map() sample() flatMap() Data Storage Worker Executer
Stage 2 – reduceByKey() Spark Execution Model Data Storage Worker Executer Data Storage Worker Executer RDD P0 Wide	Transformation Master join() reduceByKey() union() groupByKey() Shuffle	! Data Storage Worker Executer Data Storage Worker Executer
Batch vs. Real-Time Processing Petabytes	of	Data Gigabytes Per	Second
Various Input Sources
Apache Kafka distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …) Initially developed at LinkedIn, now part of Apache Does not use JMS API and standards Kafka maintains feeds of messages in topics Kafka Cluster Consumer Consumer Consumer Producer Producer Producer
Apache Kafka Kafka Broker Temperature Processor Temperature	Topic Rainfall	Topic 1 2 3 4 5 6 Rainfall Processor1 2 3 4 5 6 Weather Station
Apache Kafka Kafka Broker Temperature Processor Temperature	Topic Rainfall	Topic 1 2 3 4 5 6 Rainfall Processor Partition	0 1 2 3 4 5 6 Partition	0 1 2 3 4 5 6 Partition	1 Temperature Processor Weather Station
Apache Kafka Kafka Broker Temperature Processor Weather Station Temperature	Topic Rainfall	Topic Rainfall Processor P	0 Temperature Processor 1 2 3 4 5 P	1 1 2 3 4 5 Kafka Broker Temperature	Topic Rainfall	Topic P	0 1 2 3 4 5 P	1 1 2 3 4 5 P	0 1 2 3 4 5 P	0 1 2 3 4 5
Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station
Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station
Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station
Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station Discrete	by	time Individual	Event DStream =	RDD
Discretized Stream (DStream) DStream DStream X	Seconds Transform .countByValue() .reduceByKey() .join .map
Discretized Stream (DStream) time	1 time	2 time	3 message time	n…. f(message 1) RDD	@time	1 f(message 2) f(message n) …. message 1 RDD	@time	1 message 2 message n …. result 1 result 2 result n …. message message message f(message 1) RDD	@time	2 f(message 2) f(message n) …. message 1 RDD	@time	2 message 2 message n …. result 1 result 2 result n …. f(message 1) RDD	@time	3 f(message 2) f(message n) …. message 1 RDD	@time	3 message 2 message n …. result 1 result 2 result n …. f(message 1) RDD	@time	n f(message 2) f(message n) …. message 1 RDD	@time	n message 2 message n …. result 1 result 2 result n …. Input	Stream Event	DStream MappedDStream map() saveAsHadoopFiles() Time	Increasing DStreamTransformation	Lineage Actions	Trigger Spark	Jobs Adapted	from	Chris	Fregly: http://slidesha.re/11PP7FV
Apache Spark Streaming – Core concepts Discretized Stream (DStream) • Core Spark Streaming abstraction • micro batches of RDD’s • Operations similar to RDD Input DStreams • Represents the stream of raw data received from streaming sources • Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. • Custom Sources can be easily written for custom data sources Operations • Same as Spark Core + Additional Stateful transformations (window, reduceByWindow)
Apache Cassandra
Apache Cassandra Apache Cassandra™ is a free • Distributed… • High performance… • Extremely scalable… • Fault tolerant (i.e. no single point of failure)… post-relational database solution Optimized for high write throughput
Apache Cassandra - History Bigtable Dynamo
Motivation - Why NoSQL Databases? aaa • Dynamo Paper (2007) • How to build a data store that is • Reliable • Performant • “Always On” • Nothing new and shiny • 24 other papers cited • Evolutionary
Motivation - Why NoSQL Databases? • Google Big Table (2006) • Richer data model • 1 key and lot’s of values • Fast sequential access • 38 other papers cited
Motivation - Why NoSQL Databases? • Cassandra Paper (2008) • Distributed features of Dynamo • Data Model and storage from BigTable • February 2010 graduated to a top-level Apache Project
Apache Cassandra – More than one server All nodes participate in a cluster Shared nothing Add or remove as needed More capacity? Add more servers Node is a basic unit inside a cluster Each node owns a range of partitions Consistent Hashing Node	1 Node	2 Node	3 Node	4 [26-50] [0-25] [51-75] [76-100] [0-25] [0-25] [26-50] [26-50] [51-75] [51-75] [76-100] [76-100]
Apache Cassandra – Fully Replicated Client writes local Data syncs across WAN Replication per Data Center Node	1 Node	2 Node	3 Node	4 Node	1 Node	2 Node	3 Node	4 West	East Client
Apache Cassandra What is Cassandra NOT? • A Data Ocean • A Data Lake • A Data Pond • An In-Memory Database • A Key-Value Store • Not for Data Warehousing What are good use cases? • Product Catalog / Playlists • Personalization (Ads, Recommendations) • Fraud Detection • Time Series (Finance, Smart Meter) • IoT / Sensor Data • Graph / Network data
How Cassandra stores data • Model brought from Google Bigtable • Row Key and a lot of columns • Column names sorted (UTF8, Int, Timestamp, etc.) Column	Name … Column Name Column	Value Column	Value Timestamp Timestamp TTL TTL Row	Key 1 2	Billion Billion	of	Rows
Combining Spark & Cassandra
Spark and Cassandra Architecture – Great Combo Good	at	analyzing	a	huge	amount of	data Good	at	storing	a	huge	amount	of data
Spark and Cassandra Architecture Spark	Streaming (Near	Real-Time) SparkSQL (Structured	Data) MLlib (Machine	Learning) GraphX (Graph	Analysis)
Spark and Cassandra Architecture Spark	Connector Weather Station Spark	Streaming (Near	Real-Time) SparkSQL (Structured	Data) MLlib (Machine	Learning) GraphX (Graph	Analysis) Weather Station Weather Station Weather Station Weather Station
Spark and Cassandra Architecture • Single Node running Cassandra • Spark Worker is really small • Spark Master lives outside a node • Spark Worker starts Spark Executer in separate JVM • Node local Worker Master Executer Executer Server Executer
Spark and Cassandra Architecture Worker Worker Worker Master Worker • Each node runs Spark and Cassandra • Spark Master can make decisions based on Token Ranges • Spark likes to work on small partitions of data across a large cluster • Cassandra likes to spread out data in a large cluster 0-25 26-50 51-75 76-100 Will	only have to	analyze	25% of	data!
Spark and Cassandra Architecture Master 0-25 26-50 51-75 76-100 Worker Worker WorkerWorker 0-25 26-50 51-75 76-100 Transactional Analytics
Cassandra and Spark Cassandra Cassandra	&	Spark Joins	and	Unions No Yes Transformations Limited Yes Outside	Data	Integration No Yes Aggregations Limited Yes
Summary
Summary Kafka • Topics store information broken into partitions • Brokers store partitions • Partitions are replicated for data resilience Cassandra • Goals of Apache Cassandra are all about staying online and performant • Best for applications close to your users • Partitions are similar data grouped by a partition key Spark • Replacement for Hadoop Map Reduce • In memory • More operations than just Map and Reduce • Makes data analysis easier • Spark Streaming can take a variety of sources Spark + Cassandra • Cassandra acts as the storage layer for Spark • Deploy in a mixed cluster configuration • Spark executors access Cassandra using the DataStax connector
Lambda Architecture with Spark/Cassandra Data Collection (Analytical)	Batch	Data	Processing Batch compute Result	StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical)	Real-Time	Data	Processing Stream/Event	Processing Batch compute Messaging Result	Store Query Engine Result	Store Computed Information Raw	Data (Reservoir)
Lambda Architecture with Spark/Cassandra Data Collection (Analytical)	Batch	Data	Processing Batch compute Result	StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical)	Real-Time	Data	Processing Stream/Event	Processing Batch compute Messaging Result	Store Query Engine Result	Store Computed Information Raw	Data (Reservoir)
Guido Schmutz Technology Manager guido.schmutz@trivadis.com

Real-Time Analytics with Apache Cassandra and Apache Spark