Lessons Learned: Using Spark and Microservices

LESSONS LEARNED: USING SPARK AND MICROSERVICES (TO EMPOWER DATA SCIENTISTS AND DATA ENGINEERS) Alexis Seigneurin

Who I am • Software engineer for 15+ years • Consultant at Ippon USA, previously at Ippon France • Favorite subjects: Spark, Machine Learning, Cassandra • Spark trainer • @aseigneurin

• 200 software engineers in France, the US and Australia • In the US: ofﬁces in DC, NYC and Richmond,Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster,Tatami, etc. • @ipponusa

The project • Analyze records from customers → Give feedback to the customer on their data • High volume of data • 25 millions records per day (average) • Need to keep at least 60 days of history = 1.5 Billion records • Seasonal peaks... • Need an hybrid platform • Batch processing for some types of analysis • Streaming for other analyses • Hybrid team • Data Scientists: more familiar with Python • Software Engineers: Java

Processing technology - Spark • Mature platform • Supports batch jobs and streaming jobs • Support for multiple programming languages • Python → Data Scientists • Scala/Java → Software Engineers

Architecture - Real time platform • New use cases are implemented by Data Scientists all the time • Need the implementations to be independent from each other • One Spark Streaming job per use case • Microservice-inspired architecture • Diamond-shaped • Upstream jobs are written in Scala • Core is made of multiple Python jobs, one per use case • Downstream jobs are written in Scala • Plumbing between the jobs → Kafka 1/2

Architecture - Real time platform 2/2

Messaging technology - Kafka From kafka.apache.org • “A high-throughput distributed messaging system” • Messaging: between 2 Spark jobs • Distributed: ﬁts well with Spark, can be scaled up or down • High-throughput: so as to handle an average of 300 messages/second, peaks at 2000 m/s • “Apache Kafka is publish-subscribe messaging rethought as a distributed commit log” • Commit log so that you can go back in time and reprocess data • Only used as such when a job crashes, for resilience purposes

Storage • Currently PostgreSQL: • SQL databases are well known by developers and easy to work with • PostgreSQL is available “as-a-service” on AWS • Working on transitioning to Cassandra (more on that later)

Deployment platform • Amazon AWS • Company standard - Everything in the cloud • Easy to scale up or down, ability to choose the hardware • Some limitations • Requirement to use company-crafted AMIs • Cannot use some services (EMR…) • AMIs are renewed every 2 months → need to recreate the platform continuously

Modularity • One Spark job per use case • Hot deployments: can roll out new use cases (= new jobs) without stopping existing jobs • Can roll out updated code without affecting other jobs • Able to measure the resources consumed by a single job • Shared services are provided by upstream and downstream jobs

A/B testing • A/B testing of updated features • Run 2 implementations of the code in parallel • Let each ﬁlter process the data of all the customers • Post-ﬁlter to let the customers receive A or B • (Measure…) • Can be used to slowly roll out new features

Data Scientists can contribute • Spark in Python → pySpark • Data Scientists know Python (and don’t want to hear about Java/ Scala!) • Business logic implemented in Python • Code is easy to write and to read • Data Scientists are real contributors → quick iterations to production

Data Scientist code in production • Shipping code written by Data Scientists is not ideal • Need production-grade code (error handling, logging…) • Code is less tested than Scala code • Harder to deploy than a JAR ﬁle → PythonVirtual Environments • blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache- hadoop-cluster-for-pyspark-jobs/

Allocation of resources in Spark • With Spark Streaming, resources (CPU & memory) are allocated per job • Resources are allocated when the job is submitted and cannot be updated on the ﬂy • Have to allocate 1 core to the Driver of the job → unused resource • Have to allocate extra resources to each job to handle variations in trafﬁc → unused resources • For peak periods, easy to add new Spark Workers but jobs have to restarted • Idea to be tested: • Over allocation of real resources, e.g let Spark know it has 6 cores on a 4-cores server

Micro-batches Spark streaming processes events in micro-batches • Impact on the latency • Spark Streaming micro-batches → hard to achieve sub-second latency • See spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads • Total latency of the system = sum of the latencies of each stage • In this use case, events are independent from each other - no need for windowing computation → a real streaming framework would be more appropriate • Impact on memory usage • Kafka+Spark using the direct approach = 1 RDD partition per Kafka partition • If you start the Spark with lots of unprocessed data in Kafka, RDD partitions can exceed the size of the memory

Resilience of Spark jobs • Spark Streaming application = 1 Driver + 1 Application • Application = N Executors • If an Executor dies → restarted (seamless) • If the Driver dies, the whole Application must be restarted • Scala/Java jobs → “supervised” mode • Python jobs → not supported with Spark Standalone

Resilience with Spark & Kafka • Connecting Spark to Kafka, 2 methods: • Receiver-based approach: not ideal for parallelism • Direct approach: better for parallelism but have to deal with Kafka offsets • Dealing with Kafka offsets • Default: consumes from the end of the Kafka topic (or the beginning) • Documentation → Use checkpoints • Tasks have to be Serializable (not always possible: dependent libraries) • Harder to deploy the application (classes are serialized) → run a new instance in parallel and kill the ﬁrst one (harder to automate; messages consumed twice) • Requires a shared ﬁle system (HDFS, S3) → big latency on these FS that forces to increase the micro-batch interval 1/2

Resilience with Spark & Kafka • Dealing with Kafka offsets • Solution: deal with offsets in the Spark Streaming application • Write the offsets to a reliable storage: ZooKeeper, Kafka… • Write after processing the data • Read the offsets on startup (if no offsets, start from the end) • ippon.tech/blog/spark-kafka-achieving-zero-data-loss/ 2/2

Writing to Kafka • Spark Streaming comes with a library to read from Kafka but none to write to Kafka! • Flink or Kafka Streams do that out-of-the-box • Cloudera provides an open-source library: • github.com/cloudera/spark-kafka-writer • (Has been removed by now!)

Idempotence Spark and fault-tolerance semantics: • Spark can provide exactly once guarantee only for the transformation of the data • Writing the data is at least once with non-transactional systems (including Kafka in our case) • See spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-semantics →The overall system has to be idempotent

Message format & schemas • Spark jobs are decoupled, but each depends on the upstream job • Message formats have to be agreed upon • JSON • Pros: flexible • Cons: flexible! (missing fields) • Avro • Pros: enforces a structure (named fields + types) • Cons: hard to propagate the schemas → Confluent’s Schema Registry (more on that later)

Potential & upcoming improvements

Confluent’s Schema Registry docs.confluent.io/3.0.0/schema-registry/docs/index.html • Separate (web) server to manage & enforce Avro schemas • Stores schemas, versions them, and can perform compatibility checks (configurable: backward or forward) • Makes life simpler: ✓ no need to share schemas (“what version of the schema is this?”) ✓ no need to share generated classes ✓ can update the producer with backward-compatible messages without affecting the consumers 1/2

Conﬂuent’s Schema Registry • Comes with: • A Kafka Serializer (for the producer): sends the schema of the object to the Schema Registry before sending the record to Kafka • Message sending fails if schema compatibility fails • A Kafka Decoder (for the consumer): retrieves the schema from the Schema Registry when a message comes in 2/2

Kafka Streams docs.conﬂuent.io/3.0.0/streams/index.html • “powerful, easy-to-use library for building highly scalable, fault-tolerant, distributed stream processing applications on top of Apache Kafka” • Perfect ﬁt for micro-services on top of Kafka • Natively consumes messages from Kafka • Natively pushes produced messages to Kafka • Processes messages one at a time → very low latency 1/2 • Pros • API is very similar to Spark’s API • Deploy new instances of the application to scale out • Cons • JVM languages only - no support for Python • Outside of Spark - one more thing to manage

Kafka Streams Properties props = new Properties();  props.put(StreamsConfig.APPLICATION_ID_CONFIG, "xxx");  props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9093");  props.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2182");  props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());  props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());    props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");    KStreamBuilder builder = new KStreamBuilder();    KStream<String, String> kafkaInput = builder.stream(“INPUT-TOPIC");  KStream<String, RealtimeXXX> auths = kafkaInput.mapValues(value -> ...);  KStream<String, byte[]> serializedAuths = auths.mapValues(a -> AvroSerializer.serialize(a));    serializedAuths.to(Serdes.String(), Serdes.ByteArray(), “OUTPUT-TOPIC");    KafkaStreams streams = new KafkaStreams(builder, props);  streams.start(); 2/2 Example (Java)

Database migration • The database stores the state • Client settings or analyzed behavior • Historical data (up to 60 days) • Produced outputs • Some technologies can store a state (e.g. Samza) but hardly 60 days of data • Initially used PostgreSQL • Easy to start with • Available on AWS “as-a-service”: RDS • Cannot scale to 60 days of historical data, though • Cassandra is a good ﬁt • Scales out for the storage of historical data • Connects to Spark • Load Cassandra data into Spark, or saves data from Spark to Cassandra • Can be used to reprocess existing data for denormalization purposes

Summary Is the microservices architecture adequate? • Interesting to separate the implementations of the use cases • Overhead for the other services Is Spark adequate? • Supports Python (not supported by Kafka Streams) • Micro-batches not adequate

Lessons Learned: Using Spark and Microservices

More Related Content

What's hot

Similar to Lessons Learned: Using Spark and Microservices

More from Alexis Seigneurin

Recently uploaded

Lessons Learned: Using Spark and Microservices