Unified Stream and Batch Processing with Apache Flink
The document discusses Apache Flink, an open source stream processing framework. It provides high throughput and low latency processing of both streaming and batch data. Flink allows for explicit handling of event time, stateful stream processing with exactly-once semantics, and high performance. It also supports features like windowing, sessionization, and complex event processing that are useful for building streaming applications.
Presentation by Ufuk Celebi at Hadoop Summit Dublin on April 13, 2016, introduces Unified Stream & Batch Processing with Apache Flink.
Apache Flink is an open-source stream processing framework notable for event time handling, state and fault tolerance, low latency, and high throughput.
Example of a basic counting application to track visitors and ads, which provides a foundation for generalizable solutions.
Batch processing overview using various platforms like Hadoop, Spark, and Flink, illustrated by counting events in a dataset.
Concept of continuous counting with periodic batch jobs to manage incoming data over defined time intervals.
Explains complexities in batch processing including data loading, job scheduling, and using processors like Hadoop, Spark, and Flink.
High latency issues in batch jobs discussed, highlighting the delay from event processing to serving layer.
Treatment of time in batch jobs illustrated through examples, emphasizing the implicit nature of time within files.
Shift towards dedicated stream processing, reducing reliance on batch processors and avoiding lambda architectures, enhanced by modern solutions like Flink.
Key components of stream processing in Flink, such as message queues and stream processors that ensure durability and consistent processing.Building blocks of Flink include explicit handling of time, state and fault tolerance, and performance metrics for streaming applications.
Different windowing techniques in Flink, including tumbling and sliding windows for time-based and data-based aggregations.
Flink's approach to explicit time handling, including session windows that close after periods of inactivity.
Differentiates between event time and processing time in data streams and the handling of out-of-order events.
Explains processing semantics in Flink emphasizing 'at-least once', 'exactly once', and their implications for state consistency.
Yahoo! benchmark results comparing Flink, Storm, and Spark Streaming focused on end-to-end latency and throughput.
Suggestions on extending existing benchmarks to account for high write throughput and Flink's fault tolerance features.
Highlights advantages of Flink in stream processing for continuous data applications and the importance of framework choice.
Describes Flink libraries for complex event processing and various APIs for stream and batch processing.
Planned features for Flink including SQL support, dynamic scaling, and queryable state for improved resource management.
Details on dynamic scaling in Flink jobs and the capability to query data directly within the Flink processing environment.
Encourages collaboration and engagement with the Flink community through blogs and social media.
What is ApacheFlink? 2 Apache Flink is an open source stream processing framework. • Event Time Handling • State & Fault Tolerance • Low Latency • High Throughput Developed at the Apache Software Foundation.
3.
Recent History 3 April ‘14December ‘14 v0.5 v0.6 v0.7 March ‘16 Project Incubation Top Level Project v0.8 v0.10 Release 1.0
4.
Flink Stack 4 DataStream API StreamProcessing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
Streaming 15 Until now, streamprocessors were less mature than their batch counterparts. This led to: • in-house solutions, • abuse of batch processors, • Lambda architectures This is no longer needed with new generation stream processors like Flink.
Tumbling Windows (NoOverlap) 19 Time e.g.“Count over the last 5 minutes”, “Average over the last 100 records”
30.
Tumbling Windows (NoOverlap) 19 Time e.g.“Count over the last 5 minutes”, “Average over the last 100 records”
31.
Tumbling Windows (NoOverlap) 19 Time e.g.“Count over the last 5 minutes”, “Average over the last 100 records”
32.
Tumbling Windows (NoOverlap) 19 Time e.g.“Count over the last 5 minutes”, “Average over the last 100 records”
33.
Tumbling Windows (NoOverlap) 19 Time e.g.“Count over the last 5 minutes”, “Average over the last 100 records”
34.
Sliding Windows (withOverlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”, “Average over the last 100 elements, updated every 10 elements”
35.
Sliding Windows (withOverlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”, “Average over the last 100 elements, updated every 10 elements”
36.
Sliding Windows (withOverlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”, “Average over the last 100 elements, updated every 10 elements”
37.
Sliding Windows (withOverlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”, “Average over the last 100 elements, updated every 10 elements”
38.
Sliding Windows (withOverlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”, “Average over the last 100 elements, updated every 10 elements”
39.
Explicit Handling ofTime 21 DataStream<ColorEvent> counts = env .addSource(new KafkaConsumer(…)) .keyBy("color") .timeWindow(Time.minutes(60)) .apply(new CountPerWindow()); Time is explicit in your program
Notions of Time 24 12:23am Event Time 1:37 pm Processing Time Time measured by system clock Time when event happened.
44.
1977 1980 19831999 2002 2005 2015 Processing Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Event Time Out of Order Events 25
45.
Out of OrderEvents 26 1st burst of events 2nd burst of events Event Time Windows Processing Time Windows
Processing Semantics 31 At-least once Mayover-count after failure Exactly Once Correct counts after failures End-to-end exactly once Correct counts in external system (e.g. DB, file system) after failure
51.
Processing Semantics 32 • Flinkguarantees exactly once (can be configured for at-least once if desired) • End-to-end exactly once with specific sources and sinks (e.g. Kafka -> Flink -> HDFS) • Internally, Flink periodically takes consistent snapshots of the state without ever stopping computation
52.
Building Blocks ofFlink 33 Explicit Handling of Time State & Fault Tolerance Performance
53.
Yahoo! Benchmark 34 • Storm0.10, Spark Streaming 1.5, and Flink 0.10 benchmark by Storm team at Yahoo! • Focus on measuring end-to-end latency at low throughputs (~ 200k events/sec) • First benchmark modelled after a real application https://yahooeng.tumblr.com/post/135321837876/ benchmarking-streaming-computation-engines-at
54.
Yahoo! Benchmark 35 • Countad impressions grouped by campaign • Compute aggregates over last 10 seconds • Make aggregates available for queries in Redis
99th Percentile Latency (sec) 9 8 2 1 Storm0.10 Flink 0.10 60 80 100 120 140 160 180 Throughput (1000 events/sec) Spark Streaming 1.5 Spark latency increases with throughput Storm and Flink at low latencies Latency (Lower is Better) 36
57.
Extending the Benchmark 37 •Great starting point, but benchmark stops at low write throughput and programs are not fault-tolerant • Extend benchmark to high volumes and Flink’s built-in fault-tolerant state http://data-artisans.com/extending-the-yahoo-streaming- benchmark/
Throughput (Higher isBetter) 39 5.000.000 10.000.000 15.000.000 Maximum Throughput (events/sec) 0 Flink w/ Kafka Storm w/ Kafka Limited by bandwidth between Kafka and Flink cluster
61.
Throughput (Higher isBetter) 39 5.000.000 10.000.000 15.000.000 Maximum Throughput (events/sec) 0 Flink w/o Kafka Flink w/ Kafka Storm w/ Kafka Limited by bandwidth between Kafka and Flink cluster
62.
Summary 40 • Stream processingis gaining momentum, the right paradigm for continuous data applications • Choice of framework is crucial – even seemingly simple applications become complex at scale and in production • Flink offers unique combination of efficiency, consistency and event time handling
Upcoming Features 43 • SQL:ongoing work in collaboration with Apache Calcite • Dynamic Scaling: adapt resources to stream volume, scale up for historical stream processing • Queryable State: query the state inside the stream processor