Unlocking the Power of Apache Flink: An Introduction in 4 Actsin David Anderson Software Practice Lead, Confluent Apache Flink Committer
Today’s consumers expect real-time services
Real-time Data A Sale A Shipment A Trade Rich Front-End Customer Experiences A Customer Experience Real-Time Backend Operations Real-time services rely on stream processing Real-time Stream Processing
Driving business value with Apache Flink Real-time analytics Event-driven applications Streaming data pipelines Continuously produce and update results which are displayed and delivered to users as real-time data streams are consumed ● Ad/campaign performance ● Content performance ● Quality monitoring of Telco networks ● Usage metering and billing Recognize patterns and react to incoming events by triggering computations, state updates, or external actions ● Fraud detection ● Anomaly detection ● Business process monitoring ● Geo-fencing Real-time data pipelines that continuously ingest, enrich, and transform data streams, loading them into destination systems for timely action (vs. batch processing) ● Continuous ETL ● Real-time search index building ● ML pipelines ● Data lake ingestion
Developers are choosing Flink because of Its performance and rich feature set Scalability & high performance Flink supports stream processing workloads at tremendous scale Flink supports Java, Python, & SQL, enabling developers to work in their language of choice Flink supports stream processing, batch processing, and ad-hoc analytics through one technology Unified processing Flink's checkpointing mechanism provides exactly-once guarantees automatically Fault tolerance & high availability Language flexibility Flink is a top 5 Apache project and has a very active community
@yourtwitterhandle | developer.confluent.io Streaming The four cornerstones on which Flink is built State Time Snapshots
● A stream is a sequence of events ● Business data is always a stream: bounded or unbounded ● For Flink, batch processing is just a special case in the runtime now past future bounded stream unbounded stream Streaming
Real-time services rely on stream processing Kafka Databases Key/Value Stores Files Apps Sources Real-time Stream Processing Sinks
Real-time Stream Processing Real-time services rely on stream processing Kafka Databases Key/Value Stores Files Apps Sources Sinks
The Job Graph (or Topology)
The Job Graph (or Topology) OPERATOR CONNECTION
Stream processing • Parallel • Forward • Repartition • Rebalance grouped by shape SOURCE
Stream processing • Parallel • Forward • Repartition • Rebalance grouped by shape SOURCE
Stream processing • Parallel • Forward • Repartition • Rebalance group by color FILTER
Stream processing • Parallel • Forward • Repartition • Rebalance COUNT 1 2 3 1 2 3 4
Stream processing with SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
Stream processing with SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
Stream processing with SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
Stream processing with SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
Stream processing with SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events events
Flink’s APIs Apache Flink Runtime Low-Level Stream Operator API Optimizer / Planner Table / SQL API DataStream API
Runtime Architecture
Runtime Architecture
Flink supports streaming ● Bounded or unbounded streams ● Entire pipeline must always be running ● Input must be processed as it arrives ● Results are reported as they become ready ● Failure recovery resumes from a recent snapshot ● Flink guarantees effectively exactly-once results despite out-of-order data and restarts due to failures, etc. ● Only bounded streams ● Execution proceeds in stages, running as needed ● Input may be pre-sorted by time and key ● Results are reported at the end of the job ● Failure recovery does a reset and full restart ● Effectively exactly-once guarantees are more straightforward and batch
@yourtwitterhandle | developer.confluent.io Streaming State Time Snapshots
Stateful stream processing with Flink SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
Stateful stream processing with Flink SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
Stateful stream processing with Flink SQL ● Counting requires state GROUP BY color events results COUNT WHERE color <> orange
State • Local • Fast • Fault tolerant
State • Local • Fast • Fault tolerant
@yourtwitterhandle | developer.confluent.io Streaming State Time Snapshots
Time • Synchronize • Wait • Timeout 09:05:44 When the event was created at its original source. Event time 09:08:01 When the event is being processed. This time varies between applications. Processing time
● Streams are (roughly) ordered by time Out-of-order event streams 10:10 10:14 10:10 10:14
Coping with out of order events This event will be read next
Coping with out of order events These events follow
Coping with out of order events Imagine a window counting events for the hour ending at 2:00. How long should this window wait before producing its results?
Watermarks measure progress of event time Watermark ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order
Watermarks measure progress of event time ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order ● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate 1:50 - 5 = 1:45
Watermarks measure progress of event time ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order ● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate ● A watermark is an assertion about the completeness of the stream Now this stream is complete up to 1:45
Watermarks measure progress of event time Imagine a window counting events for the hour ending at 2:00. How long should this window wait before producing its results? It should wait for a watermark with a timestamp of at least 2:00.
What are watermarks for? They make things happen when the time is right.
The idle stream problem ● Streams that are idle do not advance the watermark ● This prevents windows from producing results
The idle stream problem ● Streams that are idle do not advance the watermark ● This prevents windows from producing results Solutions ● Balance the partitions so none are empty or idle, or ● Send keep-alive events, or ● Configure the watermarking to use idleness detection
Watermarks ● Not needed for applications that only use wall-clock (processing) time ● Not needed for batch processing ● Are needed for triggering actions based on event-time, e.g., closing a window ● Are generated based on an assumption of how out of order the data might be ● Provide control over the tradeoff between completeness and latency ● Flink SQL drops late events; the DataStream API offers more control ● Allow for consistent, reproducible results ● Potentially idle sources require special attention
@yourtwitterhandle | developer.confluent.io Streaming State Time Snapshots
A checkpoint is an automatic snapshot created by Flink, primarily for the purpose of failure recovery
A checkpoint is an automatic snapshot created by Flink, primarily for the purpose of failure recovery A savepoint is a manual snapshot created for some operational purpose (e.g., a stateful upgrade)
Snapshots events results COUNT FILTER GROUP BY color
Snapshots Source Filter Count by color Sink events results COUNT FILTER GROUP BY color
Snapshots Source Filter Count by color Sink Offsets for some partitions Offsets for other partitions events results COUNT FILTER GROUP BY color
Snapshots Source Filter Count by color Sink Offsets for some partitions ______________________ Offsets for other partitions ______________________ events results COUNT FILTER GROUP BY color
Snapshots Source Filter Count by color Sink Offsets for some partitions ______________________ Counters for some colors Offsets for other partitions ______________________ Counters for other colors events results COUNT FILTER GROUP BY color
Snapshots Source Filter Count by color Sink Offsets for some partitions ______________________ Counters for some colors Transaction ID Offsets for other partitions ______________________ Counters for other colors __________________ events results COUNT FILTER GROUP BY color
Taking a snapshot does NOT stop the world Checkpoints and savepoints are created asynchronously, while the job continues to process events and produce results
Because these are self-consistent, global snapshots ● Flink provides (effectively) exactly-once guarantees ● Recovery involves restarting the entire job from the most recent checkpoint Recovery
Wrap-up
Streaming Unfamiliar to many developers, but ultimately straightforward Watermarks encapsulate something complex in one place – the sources ● how out-of-order? ● can it be idle? Transparent to application developers State snapshots for recovery Delightfully simple ● local ● key/value ● single-threaded State Event time and watermarks
Where is the Flink community? To subscribe to the mailing lists, or get an invite to Slack, see https://flink.apache.org/community/
Your Apache Flink® journey begins here developer.confluent.io

Unlocking the Power of Apache Flink: An Introduction in 4 Acts

  • 1.
    Unlocking the Powerof Apache Flink: An Introduction in 4 Actsin David Anderson Software Practice Lead, Confluent Apache Flink Committer
  • 2.
    Today’s consumers expectreal-time services
  • 3.
    Real-time Data A Sale A Shipment ATrade Rich Front-End Customer Experiences A Customer Experience Real-Time Backend Operations Real-time services rely on stream processing Real-time Stream Processing
  • 4.
    Driving business valuewith Apache Flink Real-time analytics Event-driven applications Streaming data pipelines Continuously produce and update results which are displayed and delivered to users as real-time data streams are consumed ● Ad/campaign performance ● Content performance ● Quality monitoring of Telco networks ● Usage metering and billing Recognize patterns and react to incoming events by triggering computations, state updates, or external actions ● Fraud detection ● Anomaly detection ● Business process monitoring ● Geo-fencing Real-time data pipelines that continuously ingest, enrich, and transform data streams, loading them into destination systems for timely action (vs. batch processing) ● Continuous ETL ● Real-time search index building ● ML pipelines ● Data lake ingestion
  • 5.
    Developers are choosingFlink because of Its performance and rich feature set Scalability & high performance Flink supports stream processing workloads at tremendous scale Flink supports Java, Python, & SQL, enabling developers to work in their language of choice Flink supports stream processing, batch processing, and ad-hoc analytics through one technology Unified processing Flink's checkpointing mechanism provides exactly-once guarantees automatically Fault tolerance & high availability Language flexibility Flink is a top 5 Apache project and has a very active community
  • 6.
    @yourtwitterhandle | developer.confluent.io Streaming Thefour cornerstones on which Flink is built State Time Snapshots
  • 7.
    ● A streamis a sequence of events ● Business data is always a stream: bounded or unbounded ● For Flink, batch processing is just a special case in the runtime now past future bounded stream unbounded stream Streaming
  • 8.
    Real-time services relyon stream processing Kafka Databases Key/Value Stores Files Apps Sources Real-time Stream Processing Sinks
  • 9.
    Real-time Stream Processing Real-timeservices rely on stream processing Kafka Databases Key/Value Stores Files Apps Sources Sinks
  • 10.
    The Job Graph(or Topology)
  • 11.
    The Job Graph(or Topology) OPERATOR CONNECTION
  • 12.
    Stream processing • Parallel •Forward • Repartition • Rebalance grouped by shape SOURCE
  • 13.
    Stream processing • Parallel •Forward • Repartition • Rebalance grouped by shape SOURCE
  • 14.
    Stream processing • Parallel •Forward • Repartition • Rebalance group by color FILTER
  • 15.
    Stream processing • Parallel •Forward • Repartition • Rebalance COUNT 1 2 3 1 2 3 4
  • 16.
    Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
  • 17.
    Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
  • 18.
    Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
  • 19.
    Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events
  • 20.
    Stream processing withSQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color results COUNT WHERE color <> orange events events
  • 21.
    Flink’s APIs Apache FlinkRuntime Low-Level Stream Operator API Optimizer / Planner Table / SQL API DataStream API
  • 22.
  • 23.
  • 24.
    Flink supports streaming ●Bounded or unbounded streams ● Entire pipeline must always be running ● Input must be processed as it arrives ● Results are reported as they become ready ● Failure recovery resumes from a recent snapshot ● Flink guarantees effectively exactly-once results despite out-of-order data and restarts due to failures, etc. ● Only bounded streams ● Execution proceeds in stages, running as needed ● Input may be pre-sorted by time and key ● Results are reported at the end of the job ● Failure recovery does a reset and full restart ● Effectively exactly-once guarantees are more straightforward and batch
  • 25.
  • 26.
    Stateful stream processingwith Flink SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
  • 27.
    Stateful stream processingwith Flink SQL INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange
  • 28.
    Stateful stream processingwith Flink SQL ● Counting requires state GROUP BY color events results COUNT WHERE color <> orange
  • 29.
  • 30.
  • 31.
  • 32.
    Time • Synchronize • Wait •Timeout 09:05:44 When the event was created at its original source. Event time 09:08:01 When the event is being processed. This time varies between applications. Processing time
  • 33.
    ● Streams are(roughly) ordered by time Out-of-order event streams 10:10 10:14 10:10 10:14
  • 34.
    Coping with outof order events This event will be read next
  • 35.
    Coping with outof order events These events follow
  • 36.
    Coping with outof order events Imagine a window counting events for the hour ending at 2:00. How long should this window wait before producing its results?
  • 37.
    Watermarks measure progressof event time Watermark ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order
  • 38.
    Watermarks measure progressof event time ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order ● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate 1:50 - 5 = 1:45
  • 39.
    Watermarks measure progressof event time ● This watermark has been generated by assuming that the stream is at most 5 minutes out-of-order ● The watermark is the max timestamp seen so far, minus this out-of-orderness estimate ● A watermark is an assertion about the completeness of the stream Now this stream is complete up to 1:45
  • 40.
    Watermarks measure progressof event time Imagine a window counting events for the hour ending at 2:00. How long should this window wait before producing its results? It should wait for a watermark with a timestamp of at least 2:00.
  • 41.
    What are watermarks for? Theymake things happen when the time is right.
  • 42.
    The idle stream problem ●Streams that are idle do not advance the watermark ● This prevents windows from producing results
  • 43.
    The idle stream problem ●Streams that are idle do not advance the watermark ● This prevents windows from producing results Solutions ● Balance the partitions so none are empty or idle, or ● Send keep-alive events, or ● Configure the watermarking to use idleness detection
  • 44.
    Watermarks ● Not neededfor applications that only use wall-clock (processing) time ● Not needed for batch processing ● Are needed for triggering actions based on event-time, e.g., closing a window ● Are generated based on an assumption of how out of order the data might be ● Provide control over the tradeoff between completeness and latency ● Flink SQL drops late events; the DataStream API offers more control ● Allow for consistent, reproducible results ● Potentially idle sources require special attention
  • 45.
  • 46.
    A checkpoint isan automatic snapshot created by Flink, primarily for the purpose of failure recovery
  • 47.
    A checkpoint isan automatic snapshot created by Flink, primarily for the purpose of failure recovery A savepoint is a manual snapshot created for some operational purpose (e.g., a stateful upgrade)
  • 48.
  • 49.
    Snapshots Source Filter Countby color Sink events results COUNT FILTER GROUP BY color
  • 50.
    Snapshots Source Filter Countby color Sink Offsets for some partitions Offsets for other partitions events results COUNT FILTER GROUP BY color
  • 51.
    Snapshots Source Filter Countby color Sink Offsets for some partitions ______________________ Offsets for other partitions ______________________ events results COUNT FILTER GROUP BY color
  • 52.
    Snapshots Source Filter Countby color Sink Offsets for some partitions ______________________ Counters for some colors Offsets for other partitions ______________________ Counters for other colors events results COUNT FILTER GROUP BY color
  • 53.
    Snapshots Source Filter Countby color Sink Offsets for some partitions ______________________ Counters for some colors Transaction ID Offsets for other partitions ______________________ Counters for other colors __________________ events results COUNT FILTER GROUP BY color
  • 54.
    Taking a snapshot does NOT stopthe world Checkpoints and savepoints are created asynchronously, while the job continues to process events and produce results
  • 55.
    Because these are self-consistent, global snapshots ● Flinkprovides (effectively) exactly-once guarantees ● Recovery involves restarting the entire job from the most recent checkpoint Recovery
  • 56.
  • 57.
    Streaming Unfamiliar to many developers,but ultimately straightforward Watermarks encapsulate something complex in one place – the sources ● how out-of-order? ● can it be idle? Transparent to application developers State snapshots for recovery Delightfully simple ● local ● key/value ● single-threaded State Event time and watermarks
  • 58.
    Where is the Flinkcommunity? To subscribe to the mailing lists, or get an invite to Slack, see https://flink.apache.org/community/
  • 61.
    Your Apache Flink® journeybegins here developer.confluent.io