Unlocking the Power of Apache Flink: An Introduction in 4 Acts

Unlocking the Power of Apache Flink: An Introduction in 4 Actsin David Anderson Software Practice Lead, Conﬂuent Apache Flink Committer

Today’s consumers expect real-time services

Real-time Data A Sale A Shipment A Trade Rich Front-End Customer Experiences A Customer Experience Real-Time Backend Operations Real-time services rely on stream processing Real-time Stream Processing

Driving business value with Apache Flink Real-time analytics Event-driven applications Streaming data pipelines Continuously produce and update results which are displayed and delivered to users as real-time data streams are consumed ● Ad/campaign performance ● Content performance ● Quality monitoring of Telco networks ● Usage metering and billing Recognize patterns and react to incoming events by triggering computations, state updates, or external actions ● Fraud detection ● Anomaly detection ● Business process monitoring ● Geo-fencing Real-time data pipelines that continuously ingest, enrich, and transform data streams, loading them into destination systems for timely action (vs. batch processing) ● Continuous ETL ● Real-time search index building ● ML pipelines ● Data lake ingestion

Developers are choosing Flink because of Its performance and rich feature set Scalability & high performance Flink supports stream processing workloads at tremendous scale Flink supports Java, Python, & SQL, enabling developers to work in their language of choice Flink supports stream processing, batch processing, and ad-hoc analytics through one technology Uniﬁed processing Flink's checkpointing mechanism provides exactly-once guarantees automatically Fault tolerance & high availability Language ﬂexibility Flink is a top 5 Apache project and has a very active community

@yourtwitterhandle | developer.confluent.io Streaming The four cornerstones on which Flink is built State Time Snapshots

● A stream is a sequence of events ● Business data is always a stream: bounded or unbounded ● For Flink, batch processing is just a special case in the runtime now past future bounded stream unbounded stream Streaming

Real-time services rely on stream processing Kafka Databases Key/Value Stores Files Apps Sources Real-time Stream Processing Sinks

Real-time Stream Processing Real-time services rely on stream processing Kafka Databases Key/Value Stores Files Apps Sources Sinks

The Job Graph (or Topology) OPERATOR CONNECTION

Stream processing • Parallel • Forward • Repartition • Rebalance grouped by shape SOURCE

Stream processing • Parallel • Forward • Repartition • Rebalance group by color FILTER

Stream processing • Parallel • Forward • Repartition • Rebalance COUNT 1 2 3 1 2 3 4

Stream processing with SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events

Stream processing with SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange

Stream processing with SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events events

Flink’s APIs Apache Flink Runtime Low-Level Stream Operator API Optimizer / Planner Table / SQL API DataStream API

Flink supports streaming ● Bounded or unbounded streams ● Entire pipeline must always be running ● Input must be processed as it arrives ● Results are reported as they become ready ● Failure recovery resumes from a recent snapshot ● Flink guarantees effectively exactly-once results despite out-of-order data and restarts due to failures, etc. ● Only bounded streams ● Execution proceeds in stages, running as needed ● Input may be pre-sorted by time and key ● Results are reported at the end of the job ● Failure recovery does a reset and full restart ● Effectively exactly-once guarantees are more straightforward and batch

@yourtwitterhandle | developer.confluent.io Streaming State Time Snapshots

Stateful stream processing with Flink SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange

Stateful stream processing with Flink SQL ● Counting requires state GROUP BY color events results COUNT WHERE color <> orange

State • Local • Fast • Fault tolerant

Time • Synchronize • Wait • Timeout 09:05:44 When the event was created at its original source. Event time 09:08:01 When the event is being processed. This time varies between applications. Processing time

● Streams are (roughly) ordered by time Out-of-order event streams 10:10 10:14 10:10 10:14

Coping with out of order events This event will be read next

Coping with out of order events These events follow

Coping with out of order events Imagine a window counting events for the hour ending at 2:00. How long should this window wait before producing its results?

Watermarks measure progress of event time Watermark ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order

Watermarks measure progress of event time ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order ● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate 1:50 - 5 = 1:45

Watermarks measure progress of event time ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order ● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate ● A watermark is an assertion about the completeness of the stream Now this stream is complete up to 1:45

Watermarks measure progress of event time Imagine a window counting events for the hour ending at 2:00. How long should this window wait before producing its results? It should wait for a watermark with a timestamp of at least 2:00.

What are watermarks for? They make things happen when the time is right.

The idle stream problem ● Streams that are idle do not advance the watermark ● This prevents windows from producing results

The idle stream problem ● Streams that are idle do not advance the watermark ● This prevents windows from producing results Solutions ● Balance the partitions so none are empty or idle, or ● Send keep-alive events, or ● Conﬁgure the watermarking to use idleness detection

Watermarks ● Not needed for applications that only use wall-clock (processing) time ● Not needed for batch processing ● Are needed for triggering actions based on event-time, e.g., closing a window ● Are generated based on an assumption of how out of order the data might be ● Provide control over the tradeoff between completeness and latency ● Flink SQL drops late events; the DataStream API offers more control ● Allow for consistent, reproducible results ● Potentially idle sources require special attention

A checkpoint is an automatic snapshot created by Flink, primarily for the purpose of failure recovery

A checkpoint is an automatic snapshot created by Flink, primarily for the purpose of failure recovery A savepoint is a manual snapshot created for some operational purpose (e.g., a stateful upgrade)

Snapshots events results COUNT FILTER GROUP BY color

Snapshots Source Filter Count by color Sink events results COUNT FILTER GROUP BY color

Snapshots Source Filter Count by color Sink Offsets for some partitions Offsets for other partitions events results COUNT FILTER GROUP BY color

Snapshots Source Filter Count by color Sink Offsets for some partitions ______________________ Offsets for other partitions ______________________ events results COUNT FILTER GROUP BY color

Snapshots Source Filter Count by color Sink Offsets for some partitions ______________________ Counters for some colors Offsets for other partitions ______________________ Counters for other colors events results COUNT FILTER GROUP BY color

Snapshots Source Filter Count by color Sink Offsets for some partitions ______________________ Counters for some colors Transaction ID Offsets for other partitions ______________________ Counters for other colors __________________ events results COUNT FILTER GROUP BY color

Taking a snapshot does NOT stop the world Checkpoints and savepoints are created asynchronously, while the job continues to process events and produce results

Because these are self-consistent, global snapshots ● Flink provides (effectively) exactly-once guarantees ● Recovery involves restarting the entire job from the most recent checkpoint Recovery

Streaming Unfamiliar to many developers, but ultimately straightforward Watermarks encapsulate something complex in one place – the sources ● how out-of-order? ● can it be idle? Transparent to application developers State snapshots for recovery Delightfully simple ● local ● key/value ● single-threaded State Event time and watermarks

Where is the Flink community? To subscribe to the mailing lists, or get an invite to Slack, see https://ﬂink.apache.org/community/

Your Apache Flink® journey begins here developer.conﬂuent.io

Unlocking the Power of Apache Flink: An Introduction in 4 Acts

In this document

More Related Content

What's hot

Similar to Unlocking the Power of Apache Flink: An Introduction in 4 Acts

More from HostedbyConfluent

Recently uploaded

Unlocking the Power of Apache Flink: An Introduction in 4 Acts