Unlocking the Power of Apache Flink: An Introduction in 4 Acts
The document provides an overview of Apache Flink, emphasizing its capabilities for real-time stream processing and analytics. Key features include support for various programming languages, high performance and scalability, and fault tolerance through checkpointing mechanisms. Flink is geared towards enabling real-time data pipelines and event-driven applications that can efficiently handle both bounded and unbounded streams.
Real-time Data A Sale A Shipment ATrade Rich Front-End Customer Experiences A Customer Experience Real-Time Backend Operations Real-time services rely on stream processing Real-time Stream Processing
4.
Driving business valuewith Apache Flink Real-time analytics Event-driven applications Streaming data pipelines Continuously produce and update results which are displayed and delivered to users as real-time data streams are consumed ● Ad/campaign performance ● Content performance ● Quality monitoring of Telco networks ● Usage metering and billing Recognize patterns and react to incoming events by triggering computations, state updates, or external actions ● Fraud detection ● Anomaly detection ● Business process monitoring ● Geo-fencing Real-time data pipelines that continuously ingest, enrich, and transform data streams, loading them into destination systems for timely action (vs. batch processing) ● Continuous ETL ● Real-time search index building ● ML pipelines ● Data lake ingestion
5.
Developers are choosingFlink because of Its performance and rich feature set Scalability & high performance Flink supports stream processing workloads at tremendous scale Flink supports Java, Python, & SQL, enabling developers to work in their language of choice Flink supports stream processing, batch processing, and ad-hoc analytics through one technology Unified processing Flink's checkpointing mechanism provides exactly-once guarantees automatically Fault tolerance & high availability Language flexibility Flink is a top 5 Apache project and has a very active community
● A streamis a sequence of events ● Business data is always a stream: bounded or unbounded ● For Flink, batch processing is just a special case in the runtime now past future bounded stream unbounded stream Streaming
Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
17.
Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
18.
Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
19.
Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
20.
Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events events
21.
Flink’s APIs Apache FlinkRuntime Low-Level Stream Operator API Optimizer / Planner Table / SQL API DataStream API
Flink supports streaming ●Bounded or unbounded streams ● Entire pipeline must always be running ● Input must be processed as it arrives ● Results are reported as they become ready ● Failure recovery resumes from a recent snapshot ● Flink guarantees effectively exactly-once results despite out-of-order data and restarts due to failures, etc. ● Only bounded streams ● Execution proceeds in stages, running as needed ● Input may be pre-sorted by time and key ● Results are reported at the end of the job ● Failure recovery does a reset and full restart ● Effectively exactly-once guarantees are more straightforward and batch
Stateful stream processingwith Flink SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
27.
Stateful stream processingwith Flink SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
28.
Stateful stream processingwith Flink SQL ● Counting requires state GROUP BY color events results COUNT WHERE color <> orange
Time • Synchronize • Wait •Timeout 09:05:44 When the event was created at its original source. Event time 09:08:01 When the event is being processed. This time varies between applications. Processing time
33.
● Streams are(roughly) ordered by time Out-of-order event streams 10:10 10:14 10:10 10:14
Coping with outof order events Imagine a window counting events for the hour ending at 2:00. How long should this window wait before producing its results?
37.
Watermarks measure progressof event time Watermark ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order
38.
Watermarks measure progressof event time ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order ● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate 1:50 - 5 = 1:45
39.
Watermarks measure progressof event time ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order ● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate ● A watermark is an assertion about the completeness of the stream Now this stream is complete up to 1:45
40.
Watermarks measure progressof event time Imagine a window counting events for the hour ending at 2:00. How long should this window wait before producing its results? It should wait for a watermark with a timestamp of at least 2:00.
The idle stream problem ●Streams that are idle do not advance the watermark ● This prevents windows from producing results
43.
The idle stream problem ●Streams that are idle do not advance the watermark ● This prevents windows from producing results Solutions ● Balance the partitions so none are empty or idle, or ● Send keep-alive events, or ● Configure the watermarking to use idleness detection
44.
Watermarks ● Not neededfor applications that only use wall-clock (processing) time ● Not needed for batch processing ● Are needed for triggering actions based on event-time, e.g., closing a window ● Are generated based on an assumption of how out of order the data might be ● Provide control over the tradeoff between completeness and latency ● Flink SQL drops late events; the DataStream API offers more control ● Allow for consistent, reproducible results ● Potentially idle sources require special attention
A checkpoint isan automatic snapshot created by Flink, primarily for the purpose of failure recovery
47.
A checkpoint isan automatic snapshot created by Flink, primarily for the purpose of failure recovery A savepoint is a manual snapshot created for some operational purpose (e.g., a stateful upgrade)
Snapshots Source Filter Countby color Sink Offsets for some partitions Offsets for other partitions events results COUNT FILTER GROUP BY color
51.
Snapshots Source Filter Countby color Sink Offsets for some partitions ______________________ Offsets for other partitions ______________________ events results COUNT FILTER GROUP BY color
52.
Snapshots Source Filter Countby color Sink Offsets for some partitions ______________________ Counters for some colors Offsets for other partitions ______________________ Counters for other colors events results COUNT FILTER GROUP BY color
53.
Snapshots Source Filter Countby color Sink Offsets for some partitions ______________________ Counters for some colors Transaction ID Offsets for other partitions ______________________ Counters for other colors __________________ events results COUNT FILTER GROUP BY color
54.
Taking a snapshot does NOT stopthe world Checkpoints and savepoints are created asynchronously, while the job continues to process events and produce results
Streaming Unfamiliar to many developers,but ultimately straightforward Watermarks encapsulate something complex in one place – the sources ● how out-of-order? ● can it be idle? Transparent to application developers State snapshots for recovery Delightfully simple ● local ● key/value ● single-threaded State Event time and watermarks
58.
Where is the Flinkcommunity? To subscribe to the mailing lists, or get an invite to Slack, see https://flink.apache.org/community/