Portable Streaming Pipelines with Apache Beam Frances Perry PMC for Apache Beam, Tech Lead at Google Kafka Summit, May 2017
Apache Beam: Open Source data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
The evolution of Apache Beam MapReduce Apache Beam Cloud Dataflow BigTable DremelColossus FlumeMegastore Spanner PubSub Millwheel
Agenda 1. Beam Model: Model Basics 2. Extensible IO Connectors 3. Portability: Write Once, Run Anywhere 4. Demo 5. Getting Started
Model Basics A unified model for batch and streaming
Processing time vs. event time
The Beam Model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
The Beam Model: What is being computed?
PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
The Beam Model: Where in event time?
PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
The Beam Model: When in processing time?
PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
The Beam Model: How do refinements relate?
Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
Extensible IO Connectors Like Kafka!
The Beam vision for portablility Write once, run anywhere
Beam Vision: mix and match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache BeamLanguage B SDK Language A SDK Language C SDK Runner 1 Runner 3Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model
● Beam’s Java SDK runs on multiple runtime environments, including: • Apache Apex • Apache Spark • Apache Flink • Google Cloud Dataflow • [in development] Apache Gearpump ● Cross-language infrastructure is in progress. • Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Vision: as of March 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
Example Beam Runners Apache Spark ● Open-source cluster- computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
How do you build an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
Beam: the intersection of runner functionality?
Beam: the union of runner functionality?
Beam: the future!
Categorizing Runner Capabilities http://beam.incubator.apache.org/ documentation/runners/capability-matrix/
Parallel and portable pipelines in practice Demo!
Getting Started with Apache Beam Beaming into the Future
Getting Started with Apache Beam Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Extensive documentation
Learn more! Apache Beam https://beam.apache.org Join the Beam mailing lists user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter
Demo screenshots because if I make them, I won’t need to use them
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam

Portable Streaming Pipelines with Apache Beam

  • 1.
    Portable Streaming Pipelines withApache Beam Frances Perry PMC for Apache Beam, Tech Lead at Google Kafka Summit, May 2017
  • 2.
    Apache Beam: OpenSource data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
  • 3.
    The evolution ofApache Beam MapReduce Apache Beam Cloud Dataflow BigTable DremelColossus FlumeMegastore Spanner PubSub Millwheel
  • 4.
    Agenda 1. Beam Model:Model Basics 2. Extensible IO Connectors 3. Portability: Write Once, Run Anywhere 4. Demo 5. Getting Started
  • 5.
    Model Basics A unifiedmodel for batch and streaming
  • 6.
  • 7.
    The Beam Model:asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 8.
    PCollection<KV<String, Integer>> scores= input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
  • 9.
    The Beam Model:What is being computed?
  • 10.
    PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
  • 11.
    The Beam Model:Where in event time?
  • 12.
    PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
  • 13.
    The Beam Model:When in processing time?
  • 14.
    PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
  • 15.
    The Beam Model:How do refinements relate?
  • 16.
    Customizing What WhereWhen How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 17.
  • 18.
    The Beam visionfor portablility Write once, run anywhere
  • 19.
    Beam Vision: mixand match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache BeamLanguage B SDK Language A SDK Language C SDK Runner 1 Runner 3Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model
  • 20.
    ● Beam’s JavaSDK runs on multiple runtime environments, including: • Apache Apex • Apache Spark • Apache Flink • Google Cloud Dataflow • [in development] Apache Gearpump ● Cross-language infrastructure is in progress. • Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Vision: as of March 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
  • 21.
    Example Beam Runners ApacheSpark ● Open-source cluster- computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
  • 22.
    How do youbuild an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
  • 23.
    Beam: the intersectionof runner functionality?
  • 24.
    Beam: the unionof runner functionality?
  • 25.
  • 26.
  • 27.
    Parallel and portablepipelines in practice Demo!
  • 28.
    Getting Started withApache Beam Beaming into the Future
  • 29.
    Getting Started withApache Beam Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Extensive documentation
  • 30.
    Learn more! Apache Beam https://beam.apache.org Jointhe Beam mailing lists user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter
  • 31.
    Demo screenshots because ifI make them, I won’t need to use them

Editor's Notes

  • #2 Good afternoon! My name is Frances Perry. I’m an engineer at Google and on the project management committee for Apache Beam. Today I’m going to give an introduction to Apache Beam… can you see me?
  • #3 -- which is a new open source project for expressing both batch and streaming data processing use cases When you use Beam, you’re focusing on your logic and your data, without letting runtime details leak through into your code. That separation means that a Beam pipeline can be run many existing runtimes that you know and love, including Apache Spark, Apache Flink, and Google Cloud Dataflow. To put Beam context in the broader BigData ecosystem, let’s talk briefly about its evolution.
  • #4 Google published the original paper on MapReduce in 2004 -- fundamentally change the way we do distributed processing. <animate> Inside Google, kept innovating, but initially just publishing papers. In 2014, we released Google Cloud Dataflow -- included both a new programming model and fully managed service <animate> Externally the open source community created Hadoop, and an entire ecosystem flourished around this. Beam is bring these two streams of work. It’s based on the Dataflow programming model, but generalized and integrated with the broader ecosystem.
  • #5 Today I’m going to go into more detail on two key pieces of Apache Beam the programming model that intuitively expresses data-parallel operations, including both batch and streaming use cases the portability infrastructure, which lets you execute the same Beam pipeline across multiple runtimes Next it’s time to get concrete -- I’ll show you how these concepts in practice with a demo of the same pipeline, reading from Kafka, and running on Apache Spark, Apache Flink, and Cloud Dataflow. And finally we’ll end with some pointers for getting started.
  • #6 We’ll start with a brief overview of the Beam model. If you been to other talks over the last few days, you may have heard my favorite example already, so I’ll just set the context briefly. We’re going to be be using a running example of analyzing mobile gaming logs. We’ve just launched an addictive new mobile game, where we’ve got users across the globe forming teams and scoring points on their mobile devices.
  • #7 Let’s take a look at some sample data -- the points scored for a specific team. On the x-axis we’ve got event time, and the y-axis is processing time. <animate>If everything was perfect, elements would arrive in our system immediately, and so we’d see things along this dashed line. But distributed systems often don’t cooperate. <animate> Sometimes it’s not so bad. So here this event from just before 12:07 maybe just encountered a small network delay and arrives almost immediately after 12:07 <animate> But this one over here was more like 7 minutes delayed. Perhaps our user was playing in an elevator or in a subway -- so the score is delayed by a temporary lack of network connectivity. And this graph can’t even contain what we’d see if our game supports an offline mode. If a user is playing on a transatlantic flight in airplane mode, it might be hours until that flight lands and we get those scores for processing. These types of infinite, out of order data sources can be really tricky to reason about… unless you know what questions to ask.
  • #8 The Beam model is based on four key questions: What results are calculated? Are you doing computing sums, joins, histograms, machine learning models? Where in event time are results calculated? How does the time each event originally occurred affect results? Are results aggregated for all time, in fixed windows, or as user activity sessions? When in processing time are results materialized? Does the time each element arrives in the system affect results? How do we know when to emit a result? What do we do about data that comes in late from those pesky users playing on transatlantic flights? How do refinements relate? If we choose to emit results multiple times, is each result independent and distinct, do they build upon one another? Let’s take a quick look at how we can use this questions to build a pipeline.
  • #9 Here’s a snippet from a pipeline that processes scoring results from that mobile gaming application. In Yellow, you can see the computation that we’re performing -- the what -- in this case taking team-score pairs and summing them per team. So now let’s see what happens to our sample data if we execute this in traditional batch style.
  • #10 In this looping animation, the grey line represents processing time. As the pipeline executes and processes elements, they’re accumulated into the intermediate state, just under the processing time line.. When processing completes, the system emits the result in yellow. This is pretty standard batch processing. But as we dive into the remaining three questions, that’s going to change.
  • #11 Let’s start by playing with event time. By specifying a windowing function, we can calculate independent results for different slices of event time. For example every minute, every hour, every day... In this case, our same integer summation will output one sum every two minutes.
  • #12 Now if we look at how things execute, you can see that we are calculating a independent answer for every two minute period of event time. But we’re still waiting until the entire computation completes to emit any results. That might work fine for bounded data sets, when we’ll eventually finish processing. But it’s not going to work if we’re trying to process an infinite amount of data!
  • #13 In that case we want to reduce the latency of individual results. We do that by asking for results to be triggered based on the system’s best estimate of when it has all the input data. We call this estimate the watermark.
  • #14 The watermark is drawn in green. And now the result for each window is emitted as soon as we roughly think we’re done. But again, the watermark is often just a heuristic. It’s the system’s best guess about data completeness. Right now, the watermark is too fast -- and in some cases we’re moving on without all the data. So that user who scored 9 points in the elevator is just plain out of luck. But we don’t want to be too slow either -- it’s no good if we wait to emit anything until all the flights everywhere had landed just in case someone in 16B is playing our game.
  • #15 So let’s use a more sophisticated trigger to request both speculative, early firings as data is still trickling in -- and also update results if late elements arrive. Once we do this though, we might get multiple results for the same window of event time. So we have answer the fourth question about how refined results relate. Here we choose to just continually accumulate the score.
  • #16 Now, there are multiple results for each window. Some windows, like the second, produce early, incomplete results as data arrives. There’s one on time result when we think we’ve pretty much got all the data. And there are late results if additional data comes in behind the watermark, like in the first window. And because we chose to accumulate, each result includes all the elements in the window, even if they have already been part of an earlier result.
  • #17 So we took an algorithm -- in this case it happened to be a simple integer summation. I could have used something more complicated -- but the animations would have gotten out of control. And just by tweaking just a line here or there, we went through a number of use cases -- from the simple traditional batch style through to advanced streaming situations. Just like the MapReduce model fundamentally changed the way we do distributed processing by providing the right set of abstractions, we hope that the Beam model will change the way we unify batch and streaming processing in the future.
  • #18 We’ll start with a brief overview of the Beam model. If you been to other talks over the last few days, you may have heard my favorite example already, so I’ll just set the context briefly. We’re going to be be using a running example of analyzing mobile gaming logs. We’ve just launched an addictive new mobile game, where we’ve got users across the globe forming teams and scoring points on their mobile devices.
  • #19 So that was the conceptual introduction to the types of use cases the Beam model can cover. Next, let’s talk about how the model enable portability.
  • #20 The heart of Beam is the model. We have multiple language-specific SDKs for constructing a Beam pipeline. Developers often have strong opinions about their language of choice, so we want to meet them where they are. Next, we have multiple runners for executing Beam pipelines on existing distributed processing engines. This let’s the user choose the right environment for their use case. It might be on premise, or in the cloud. It might be open source. It might be fully managed. And these needs may change over time. Now each of these runners needs to execute user processing, which means they need the ability to execute code in different languages. And to do that, while making Beam components modular, we are building APIs to that cleanly specify how a runner calls language-specific progressing.
  • #21 Now that was where the project is going. In reality, this is where we are. The Java SDK runs across multiple runtimes. However the Python SDK currently runs only on Cloud Dataflow in batch, as we’re still building out the streaming and cross-language infrastructure.
  • #22 Let’s go into a bit more detail about some of these runners. I’m going to focus today on Spark, Flink, and Dataflow. These were the original three runners in Beam and are also the three I’ll be demoing today. Many of you are probably familiar with Apache Spark. It’s a very popular choice right now in the Big Data world. It excels at in-memory and interactive computations. Apache Flink is more of a newcomer to the broader big data scene. It’s got really clean semantics for stream processing. And Cloud Dataflow is GCP’s fully managed service for data processing pipelines that evolved from all those years of internal work.
  • #23 And though each of these runners does parallel data processing, they have some significant differences in how they go about that. And that makes it tricky to build an abstraction layer around them.
  • #24 We can’t just take the intersection of the functionality of all the engines -- that’s too limited.
  • #25 And on the other hand, taking the union would be a kitchen sink of chaos…
  • #26 Really, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines. Keyed State is a great example of functionality that existed in various engines for a while and enabled interesting and common use cases, and was only recently added to Beam. And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam model.
  • #27 This also means there may be, at times, some divergence between the Beam model and the support a given runner has. So that's why Beam is tracking which portions of the model each runner currently supports -- and this get updated as new functionality is being built out.
  • #28 So with those concepts behind us, let’s get hands on.
  • #29 Finally, let’s look at how you can get started using Apache Beam
  • #31 And of course, please come learn more about Beam in general. The Beam website has all sorts of good information, including details on all the different runners.