Uploaded byconfluent

3,550 views

Portable Streaming Pipelines with Apache Beam

1) Apache Beam is an open source unified model for defining both batch and streaming data processing pipelines. It allows writing pipelines once that can run on multiple distributed processing backends. 2) The Beam model separates the data processing logic from runtime requirements. It defines concepts like processing time vs event time to allow portability across batch and streaming runners. 3) Beam supports extensible IO connectors and aims to allow pipelines written in one language to run on different runtimes through language-specific SDKs. Currently, Java and Python SDKs can run on backends like Apache Spark, Flink, and Google Cloud Dataflow.

Portable Streaming Pipelines with Apache Beam Frances Perry PMC for Apache Beam, Tech Lead at Google Kafka Summit, May 2017

Apache Beam: Open Source data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments

The evolution of Apache Beam MapReduce Apache Beam Cloud Dataflow BigTable DremelColossus FlumeMegastore Spanner PubSub Millwheel

Agenda 1. Beam Model: Model Basics 2. Extensible IO Connectors 3. Portability: Write Once, Run Anywhere 4. Demo 5. Getting Started

Model Basics A unified model for batch and streaming

Processing time vs. event time

The Beam Model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?

The Beam Model: What is being computed?

PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?

The Beam Model: Where in event time?

PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?

The Beam Model: When in processing time?

PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?

The Beam Model: How do refinements relate?

Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch

Extensible IO Connectors Like Kafka!

The Beam vision for portablility Write once, run anywhere

Beam Vision: mix and match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache BeamLanguage B SDK Language A SDK Language C SDK Runner 1 Runner 3Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model

● Beam’s Java SDK runs on multiple runtime environments, including: • Apache Apex • Apache Spark • Apache Flink • Google Cloud Dataflow • [in development] Apache Gearpump ● Cross-language infrastructure is in progress. • Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Vision: as of March 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump

Example Beam Runners Apache Spark ● Open-source cluster- computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform

How do you build an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????

Beam: the intersection of runner functionality?

Beam: the union of runner functionality?

Beam: the future!

Categorizing Runner Capabilities http://beam.incubator.apache.org/ documentation/runners/capability-matrix/

Parallel and portable pipelines in practice Demo!

Getting Started with Apache Beam Beaming into the Future

Getting Started with Apache Beam Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Extensive documentation

Learn more! Apache Beam https://beam.apache.org Join the Beam mailing lists user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter

Demo screenshots because if I make them, I won’t need to use them

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

Recommended

PDF

Unified, Efficient, and Portable Data Processing with Apache Beam

byDataWorks Summit/Hadoop Summit

PPTX

Web server hardware and software

PDF

Realizing the Promise of Portable Data Processing with Apache Beam

byDataWorks Summit

PDF

Stream Processing with Apache Flink

PDF

Realizing the promise of portability with Apache Beam

byJ On The Beach

PDF

Portable batch and streaming pipelines with Apache Beam (Big Data Application...

byMalo Denielou

PDF

Present and future of unified, portable, and efficient data processing with A...

byDataWorks Summit

DOCX

Trabajo colaborativo

byMiguel Aliaga Machuca

PDF

Introduction to Apache Beam

PPTX

Python Streaming Pipelines with Beam on Flink

byAljoscha Krettek

PPTX

ApacheBeam_Google_Theater_TalendConnect2017.pptx

PDF

Introduction to Apache Beam

byJean-Baptiste Onofré

PDF

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...

PPTX

Apache Beam (incubating)

PDF

Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...

byFlink Forward

PDF

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

PDF

The Beam Vision for Portability: "Write once run anywhere"

PDF

Present and future of unified, portable and efficient data processing with Ap...

byDataWorks Summit

PDF

ApacheBeam_Google_Theater_TalendConnect2017.pdf

PDF

Realizing the promise of portable data processing with Apache Beam

byDataWorks Summit

PDF

HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

PDF

The Next Generation of Data Processing and Open Source

byDataWorks Summit/Hadoop Summit

PDF

Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...

byFlink Forward

PDF

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...

byFlink Forward

PDF

Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...

byFlink Forward

PPTX

Talk Python To Me: Stream Processing in your favourite Language with Beam on ...

byAljoscha Krettek

PDF

Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

bySergio Fernández

PDF

hbaseconasia2017: HBase on Beam

PDF

Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)

PDF

Data in Motion Tour Seoul 2024 - Keynote

More Related Content

PDF

Unified, Efficient, and Portable Data Processing with Apache Beam

byDataWorks Summit/Hadoop Summit

PPTX

Web server hardware and software

PDF

Realizing the Promise of Portable Data Processing with Apache Beam

byDataWorks Summit

PDF

Stream Processing with Apache Flink

PDF

Realizing the promise of portability with Apache Beam

byJ On The Beach

PDF

Portable batch and streaming pipelines with Apache Beam (Big Data Application...

byMalo Denielou

PDF

Present and future of unified, portable, and efficient data processing with A...

byDataWorks Summit

DOCX

Trabajo colaborativo

byMiguel Aliaga Machuca

Unified, Efficient, and Portable Data Processing with Apache Beam

byDataWorks Summit/Hadoop Summit

Web server hardware and software

Realizing the Promise of Portable Data Processing with Apache Beam

byDataWorks Summit

Stream Processing with Apache Flink

Realizing the promise of portability with Apache Beam

byJ On The Beach

Portable batch and streaming pipelines with Apache Beam (Big Data Application...

byMalo Denielou

Present and future of unified, portable, and efficient data processing with A...

byDataWorks Summit

Trabajo colaborativo

byMiguel Aliaga Machuca

Similar to Portable Streaming Pipelines with Apache Beam

PDF

Introduction to Apache Beam

PPTX

Python Streaming Pipelines with Beam on Flink

byAljoscha Krettek

PPTX

ApacheBeam_Google_Theater_TalendConnect2017.pptx

PDF

Introduction to Apache Beam

byJean-Baptiste Onofré

PDF

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...

PPTX

Apache Beam (incubating)

PDF

Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...

byFlink Forward

PDF

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

PDF

The Beam Vision for Portability: "Write once run anywhere"

PDF

Present and future of unified, portable and efficient data processing with Ap...

byDataWorks Summit

PDF

ApacheBeam_Google_Theater_TalendConnect2017.pdf

PDF

Realizing the promise of portable data processing with Apache Beam

byDataWorks Summit

PDF

HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

PDF

The Next Generation of Data Processing and Open Source

byDataWorks Summit/Hadoop Summit

PDF

Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...

byFlink Forward

PDF

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...

byFlink Forward

PDF

Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...

byFlink Forward

PPTX

Talk Python To Me: Stream Processing in your favourite Language with Beam on ...

byAljoscha Krettek

PDF

Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

bySergio Fernández

PDF

hbaseconasia2017: HBase on Beam

Introduction to Apache Beam

Python Streaming Pipelines with Beam on Flink

byAljoscha Krettek

ApacheBeam_Google_Theater_TalendConnect2017.pptx

Introduction to Apache Beam

byJean-Baptiste Onofré

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...

Apache Beam (incubating)

Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...

byFlink Forward

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

The Beam Vision for Portability: "Write once run anywhere"

Present and future of unified, portable and efficient data processing with Ap...

byDataWorks Summit

ApacheBeam_Google_Theater_TalendConnect2017.pdf

Realizing the promise of portable data processing with Apache Beam

byDataWorks Summit

HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

The Next Generation of Data Processing and Open Source

byDataWorks Summit/Hadoop Summit

Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...

byFlink Forward

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...

byFlink Forward

Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...

byFlink Forward

Talk Python To Me: Stream Processing in your favourite Language with Beam on ...

byAljoscha Krettek

Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

bySergio Fernández

hbaseconasia2017: HBase on Beam

More from confluent

PDF

Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)

PDF

Data in Motion Tour Seoul 2024 - Keynote

PDF

Data in Motion Tour 2024 Riyadh, Saudi Arabia

PDF

Build a Real-Time Decision Support Application for Financial Market Traders w...

PDF

Five Things You Need to Know About Data Streaming in 2025

PDF

Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...

PDF

Unlocking value with event-driven architecture by Confluent

PDF

Data in Motion Tour Seoul 2024 - Roadmap Demo

PDF

From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...

PDF

Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...

PDF

Building Real-Time Gen AI Applications with SingleStore and Confluent

PDF

Speed Wins: From Kafka to APIs in Minutes

PDF

Break data silos with real-time connectivity using Confluent Cloud Connectors

PDF

Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks

PDF

Migration, backup and restore made easy using Kannika

PPTX

Webinar Think Right - Shift Left - 19-03-2025.pptx

PDF

Evolving Data Governance for the Real-time Streaming and AI Era

PDF

Il Data Streaming per un’AI real-time di nuova generazione

PDF

Strumenti e Strategie di Stream Governance con Confluent Platform

PDF

Building API data products on top of your real-time data infrastructure

Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)

Data in Motion Tour Seoul 2024 - Keynote

Data in Motion Tour 2024 Riyadh, Saudi Arabia

Build a Real-Time Decision Support Application for Financial Market Traders w...

Five Things You Need to Know About Data Streaming in 2025

Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...

Unlocking value with event-driven architecture by Confluent

Data in Motion Tour Seoul 2024 - Roadmap Demo

From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...

Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...

Building Real-Time Gen AI Applications with SingleStore and Confluent

Speed Wins: From Kafka to APIs in Minutes

Break data silos with real-time connectivity using Confluent Cloud Connectors

Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks

Migration, backup and restore made easy using Kannika

Webinar Think Right - Shift Left - 19-03-2025.pptx

Evolving Data Governance for the Real-time Streaming and AI Era

Il Data Streaming per un’AI real-time di nuova generazione

Strumenti e Strategie di Stream Governance con Confluent Platform

Building API data products on top of your real-time data infrastructure

Recently uploaded

PPTX

Building AI agents in Java - Devoxx Belgium 2025

byJulien Dubois

PDF

DevOps Monitoring Tools: The 2025 Guide to Performance & Observability

byalexendrascott01

PDF

Smarter Testing Safer Systems Balancing AI and Oversight in Regulated Environ...

PDF

Bring AI and build AI agents into your Jakarta EE Apps with LangChain4J-CDI

PDF

SCORM Cloud: The 5 categories of content distribution

byRustici Software

PDF

Breaking the Vulnerability Management Cycle with Anchore and Echo

PDF

How Device, OS, and Network Variability Affects App Experience - And How to T...

bykalichargn70th171

PDF

BCA 1st Semester Fundamentals Solved Question Paper 44121

byKuvempu University

PPTX

Understanding-ROM-RAM-Cache-and-Buffer-Memory.pptx.pptx

PDF

DSD-INT 2025 Advancing Urban Flood Modeling with Delft3D FM 1D2D - A Pilot St...

PDF

DSD-INT 2025 Next-Generation Flood Inundation Mapping for Taiwan - Challenges...

PDF

DSD-INT 2025 Understanding the Paraguay River Response to Hard Bottom Dredgin...

PDF

DSD-INT 2025 UK Coastal Flooding Incident Guide - Dam

PDF

Internship Project Training Report-3.pdf

PDF

DSD-INT 2025 Reviving a Lost Meander - Deen

PDF

DSD-INT 2025 Climate and impact attribution of compound flooding induced by t...

PDF

DSD-INT 2025 Building-Aware Flood and Lifeline Scour Modeling with Delft3D FM...

PDF

DSD-INT 2025 Flood Early Warning System for the Trans-African Hydrometeorolog...

PDF

DSD-INT 2025 Timor-Leste Flood exposure and Climate Change assessment - Ogunwumi

PDF

DSD-INT 2025 Modelling Non-Newtonian Slurry Beaching with Delft3D-Slurry - Bi

Building AI agents in Java - Devoxx Belgium 2025

byJulien Dubois

DevOps Monitoring Tools: The 2025 Guide to Performance & Observability

byalexendrascott01

Smarter Testing Safer Systems Balancing AI and Oversight in Regulated Environ...

Bring AI and build AI agents into your Jakarta EE Apps with LangChain4J-CDI

SCORM Cloud: The 5 categories of content distribution

byRustici Software

Breaking the Vulnerability Management Cycle with Anchore and Echo

How Device, OS, and Network Variability Affects App Experience - And How to T...

bykalichargn70th171

BCA 1st Semester Fundamentals Solved Question Paper 44121

byKuvempu University

Understanding-ROM-RAM-Cache-and-Buffer-Memory.pptx.pptx

DSD-INT 2025 Advancing Urban Flood Modeling with Delft3D FM 1D2D - A Pilot St...

DSD-INT 2025 Next-Generation Flood Inundation Mapping for Taiwan - Challenges...

DSD-INT 2025 Understanding the Paraguay River Response to Hard Bottom Dredgin...

DSD-INT 2025 UK Coastal Flooding Incident Guide - Dam

Internship Project Training Report-3.pdf

DSD-INT 2025 Reviving a Lost Meander - Deen

DSD-INT 2025 Climate and impact attribution of compound flooding induced by t...

DSD-INT 2025 Building-Aware Flood and Lifeline Scour Modeling with Delft3D FM...

DSD-INT 2025 Flood Early Warning System for the Trans-African Hydrometeorolog...

DSD-INT 2025 Timor-Leste Flood exposure and Climate Change assessment - Ogunwumi

DSD-INT 2025 Modelling Non-Newtonian Slurry Beaching with Delft3D-Slurry - Bi

Portable Streaming Pipelines with Apache Beam

1.
Portable Streaming Pipelines withApache Beam Frances Perry PMC for Apache Beam, Tech Lead at Google Kafka Summit, May 2017
2.
Apache Beam: OpenSource data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
3.
The evolution ofApache Beam MapReduce Apache Beam Cloud Dataflow BigTable DremelColossus FlumeMegastore Spanner PubSub Millwheel
4.
Agenda 1. Beam Model:Model Basics 2. Extensible IO Connectors 3. Portability: Write Once, Run Anywhere 4. Demo 5. Getting Started
5.
Model Basics A unifiedmodel for batch and streaming
6.
Processing time vs.event time
7.
The Beam Model:asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
8.
PCollection<KV<String, Integer>> scores= input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
9.
The Beam Model:What is being computed?
10.
PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
11.
The Beam Model:Where in event time?
12.
PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
13.
The Beam Model:When in processing time?
14.
PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
15.
The Beam Model:How do refinements relate?
16.
Customizing What WhereWhen How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
17.
Extensible IO Connectors LikeKafka!
18.
The Beam visionfor portablility Write once, run anywhere
19.
Beam Vision: mixand match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache BeamLanguage B SDK Language A SDK Language C SDK Runner 1 Runner 3Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model
20.
● Beam’s JavaSDK runs on multiple runtime environments, including: • Apache Apex • Apache Spark • Apache Flink • Google Cloud Dataflow • [in development] Apache Gearpump ● Cross-language infrastructure is in progress. • Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Vision: as of March 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
21.
Example Beam Runners ApacheSpark ● Open-source cluster- computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
22.
How do youbuild an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
23.
Beam: the intersectionof runner functionality?
24.
Beam: the unionof runner functionality?
25.
Beam: the future!
26.
Categorizing Runner Capabilities http://beam.incubator.apache.org/ documentation/runners/capability-matrix/
27.
Parallel and portablepipelines in practice Demo!
28.
Getting Started withApache Beam Beaming into the Future
29.
Getting Started withApache Beam Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Extensive documentation
30.
Learn more! Apache Beam https://beam.apache.org Jointhe Beam mailing lists user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter
31.
Demo screenshots because ifI make them, I won’t need to use them

Editor's Notes

#2 Good afternoon!My name is Frances Perry. I’m an engineer at Google and on the project management committee for Apache Beam.Today I’m going to give an introduction to Apache Beam… can you see me?
#3 -- which is a new open source project for expressing both batch and streaming data processing use casesWhen you use Beam, you’re focusing on your logic and your data, without letting runtime details leak through into your code. That separation means that a Beam pipeline can be run many existing runtimes that you know and love, including Apache Spark, Apache Flink, and Google Cloud Dataflow.To put Beam context in the broader BigData ecosystem, let’s talk briefly about its evolution.
#4 Google published the original paper on MapReduce in 2004 -- fundamentally change the way we do distributed processing.<animate> Inside Google, kept innovating, but initially just publishing papers. In 2014, we released Google Cloud Dataflow -- included both a new programming model and fully managed service <animate> Externally the open source community created Hadoop, and an entire ecosystem flourished around this. Beam is bring these two streams of work. It’s based on the Dataflow programming model, but generalized and integrated with the broader ecosystem.
#5 Today I’m going to go into more detail on two key pieces of Apache Beam the programming model that intuitively expresses data-parallel operations, including both batch and streaming use cases the portability infrastructure, which lets you execute the same Beam pipeline across multiple runtimes Next it’s time to get concrete -- I’ll show you how these concepts in practice with a demo of the same pipeline, reading from Kafka, and running on Apache Spark, Apache Flink, and Cloud Dataflow.And finally we’ll end with some pointers for getting started.
#6 We’ll start with a brief overview of the Beam model. If you been to other talks over the last few days, you may have heard my favorite example already, so I’ll just set the context briefly. We’re going to be be using a running example of analyzing mobile gaming logs. We’ve just launched an addictive new mobile game, where we’ve got users across the globe forming teams and scoring points on their mobile devices.
#7 Let’s take a look at some sample data -- the points scored for a specific team. On the x-axis we’ve got event time, and the y-axis is processing time.<animate>If everything was perfect, elements would arrive in our system immediately, and so we’d see things along this dashed line. But distributed systems often don’t cooperate.<animate> Sometimes it’s not so bad. So here this event from just before 12:07 maybe just encountered a small network delay and arrives almost immediately after 12:07 <animate> But this one over here was more like 7 minutes delayed. Perhaps our user was playing in an elevator or in a subway -- so the score is delayed by a temporary lack of network connectivity.And this graph can’t even contain what we’d see if our game supports an offline mode. If a user is playing on a transatlantic flight in airplane mode, it might be hours until that flight lands and we get those scores for processing.These types of infinite, out of order data sources can be really tricky to reason about… unless you know what questions to ask.
#8 The Beam model is based on four key questions: What results are calculated? Are you doing computing sums, joins, histograms, machine learning models? Where in event time are results calculated? How does the time each event originally occurred affect results? Are results aggregated for all time, in fixed windows, or as user activity sessions? When in processing time are results materialized? Does the time each element arrives in the system affect results? How do we know when to emit a result? What do we do about data that comes in late from those pesky users playing on transatlantic flights?How do refinements relate? If we choose to emit results multiple times, is each result independent and distinct, do they build upon one another?Let’s take a quick look at how we can use this questions to build a pipeline.
#9 Here’s a snippet from a pipeline that processes scoring results from that mobile gaming application.In Yellow, you can see the computation that we’re performing -- the what -- in this case taking team-score pairs and summing them per team.So now let’s see what happens to our sample data if we execute this in traditional batch style.
#10 In this looping animation, the grey line represents processing time.As the pipeline executes and processes elements, they’re accumulated into the intermediate state, just under the processing time line.. When processing completes, the system emits the result in yellow. This is pretty standard batch processing. But as we dive into the remaining three questions, that’s going to change.
#11 Let’s start by playing with event time. By specifying a windowing function, we can calculate independent results for different slices of event time. For example every minute, every hour, every day... In this case, our same integer summation will output one sum every two minutes.
#12 Now if we look at how things execute, you can see that we are calculating a independent answer for every two minute period of event time. But we’re still waiting until the entire computation completes to emit any results. That might work fine for bounded data sets, when we’ll eventually finish processing. But it’s not going to work if we’re trying to process an infinite amount of data!
#13 In that case we want to reduce the latency of individual results. We do that by asking for results to be triggered based on the system’s best estimate of when it has all the input data. We call this estimate the watermark.
#14 The watermark is drawn in green. And now the result for each window is emitted as soon as we roughly think we’re done.But again, the watermark is often just a heuristic. It’s the system’s best guess about data completeness.Right now, the watermark is too fast -- and in some cases we’re moving on without all the data. So that user who scored 9 points in the elevator is just plain out of luck.But we don’t want to be too slow either -- it’s no good if we wait to emit anything until all the flights everywhere had landed just in case someone in 16B is playing our game.
#15 So let’s use a more sophisticated trigger to request both speculative, early firings as data is still trickling in -- and also update results if late elements arrive. Once we do this though, we might get multiple results for the same window of event time. So we have answer the fourth question about how refined results relate. Here we choose to just continually accumulate the score.
#16 Now, there are multiple results for each window.Some windows, like the second, produce early, incomplete results as data arrives.There’s one on time result when we think we’ve pretty much got all the data.And there are late results if additional data comes in behind the watermark, like in the first window.And because we chose to accumulate, each result includes all the elements in the window, even if they have already been part of an earlier result.
#17 So we took an algorithm -- in this case it happened to be a simple integer summation. I could have used something more complicated -- but the animations would have gotten out of control. And just by tweaking just a line here or there, we went through a number of use cases -- from the simple traditional batch style through to advanced streaming situations. Just like the MapReduce model fundamentally changed the way we do distributed processing by providing the right set of abstractions, we hope that the Beam model will change the way we unify batch and streaming processing in the future.
#18 We’ll start with a brief overview of the Beam model. If you been to other talks over the last few days, you may have heard my favorite example already, so I’ll just set the context briefly. We’re going to be be using a running example of analyzing mobile gaming logs. We’ve just launched an addictive new mobile game, where we’ve got users across the globe forming teams and scoring points on their mobile devices.
#19 So that was the conceptual introduction to the types of use cases the Beam model can cover. Next, let’s talk about how the model enable portability.
#20 The heart of Beam is the model.We have multiple language-specific SDKs for constructing a Beam pipeline. Developers often have strong opinions about their language of choice, so we want to meet them where they are.Next, we have multiple runners for executing Beam pipelines on existing distributed processing engines. This let’s the user choose the right environment for their use case. It might be on premise, or in the cloud. It might be open source. It might be fully managed. And these needs may change over time.Now each of these runners needs to execute user processing, which means they need the ability to execute code in different languages. And to do that, while making Beam components modular, we are building APIs to that cleanly specify how a runner calls language-specific progressing.
#21 Now that was where the project is going. In reality, this is where we are. The Java SDK runs across multiple runtimes. However the Python SDK currently runs only on Cloud Dataflow in batch, as we’re still building out the streaming and cross-language infrastructure.
#22 Let’s go into a bit more detail about some of these runners. I’m going to focus today on Spark, Flink, and Dataflow. These were the original three runners in Beam and are also the three I’ll be demoing today. Many of you are probably familiar with Apache Spark. It’s a very popular choice right now in the Big Data world. It excels at in-memory and interactive computations.Apache Flink is more of a newcomer to the broader big data scene. It’s got really clean semantics for stream processing. And Cloud Dataflow is GCP’s fully managed service for data processing pipelines that evolved from all those years of internal work.
#23 And though each of these runners does parallel data processing, they have some significant differences in how they go about that. And that makes it tricky to build an abstraction layer around them.
#24 We can’t just take the intersection of the functionality of all the engines -- that’s too limited.
#25 And on the other hand, taking the union would be a kitchen sink of chaos…
#26 Really, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines.Keyed State is a great example of functionality that existed in various engines for a while and enabled interesting and common use cases, and was only recently added to Beam.And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam model.
#27 This also means there may be, at times, some divergence between the Beam model and the support a given runner has.So that's why Beam is tracking which portions of the model each runner currently supports -- and this get updated as new functionality is being built out.
#28 So with those concepts behind us, let’s get hands on.
#29 Finally, let’s look at how you can get started using Apache Beam
#31 And of course, please come learn more about Beam in general.The Beam website has all sorts of good information, including details on all the different runners.