Apache Beam and Dataflow

Slide 1

Slide 1 text

Apache Beam and Google Cloud Dataﬂow Mete Atamel Developer Advocate at Google @meteatamel @meteatamel

Slide 2

Slide 2 text

Apache Beam: A single uniﬁed model for batch and stream data processing @meteatamel

Slide 3

Slide 3 text

MapReduce: Batch Processing (Prepare) Map (Shuffle) (Produce) Reduce @meteatamel

Slide 4

Slide 4 text

Conﬁdential & Proprietary Google Cloud Platform 4 MapReduce BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel Cloud Dataﬂow Cloud Dataproc @meteatamel

Slide 5

Slide 5 text

2012 2002 2004 2006 2008 2010 MapReduce GFS Big Table Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataﬂow 2016 Data Processing @ Google @meteatamel

Slide 6

Slide 6 text

FlumeJava: Easy and Eﬃcient MapReduce Pipelines ● Higher-level API with simple data processing abstractions. ○ Focus on what you want to do to your data, not what the underlying system supports. ● A graph of transformations is automatically transformed into an optimized series of MapReduces. @meteatamel

Slide 7

Slide 7 text

// Collection of raw events PCollection raw = ...; Example: Computing mean temperature @meteatamel // Element-wise extract location/temperature pairs PCollection> input = raw.apply(ParDo.of(new ParseFn())) // Composite transformation containing an aggregation PCollection> output = input .apply(Mean.perKey()); // Write output output.apply(BigtableIO.Write.to(...));

Slide 8

Slide 8 text

@meteatamel

Slide 9

Slide 9 text

...big data... @meteatamel

Slide 10

Slide 10 text

...really, really big... Tuesday Wednesday Thursday @meteatamel

Slide 11

Slide 11 text

Batch failure mode #1 Latency @meteatamel

Slide 12

Slide 12 text

MapReduce Tuesday Wednesday Batch failure mode #2: Sessions Jose Lisa Ingo Asha Cheryl Ari Wednesday Tuesday @meteatamel

Slide 13

Slide 13 text

Continuous & Unbounded 9:00 8:00 14:00 13:00 12:00 11:00 10:00 2:00 1:00 7:00 6:00 5:00 4:00 3:00 @meteatamel

Slide 14

Slide 14 text

Historical events Exact historical model Periodic batch processing Approximate real-time model Stream processing system Continuous updates

Slide 15

Slide 15 text

2012 2002 2004 2006 2008 2010 MapReduce GFS Big Table Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataﬂow 2016 Data Processing @ Google @meteatamel

Slide 16

Slide 16 text

MillWheel: Streaming Computations ● Framework for building low-latency data-processing applications ● User provides a DAG of computations to be performed ● System manages state and persistent flow of elements @meteatamel

Slide 17

Slide 17 text

Streaming Patterns: Element-wise transformations 13:00 14:00 8:00 9:00 10:00 11:00 12:00 Processing Time

Slide 18

Slide 18 text

Streaming Patterns: Aggregating Time Based Windows 13:00 14:00 8:00 9:00 10:00 11:00 12:00 Processing Time

Slide 19

Slide 19 text

Streaming Patterns: Event-Time Based Windows Event Time Processing Time 11:00 10:00 15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00 Input Output

Slide 20

Slide 20 text

Streaming Patterns: Session Windows Event Time Processing Time 11:00 10:00 15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00 Input Output

Slide 21

Slide 21 text

Formalizing Event-Time Skew Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.

Slide 22

Slide 22 text

1 + 1 = 2 Completeness $$$ Cost Latency Streaming or Batch? Tradeoﬀs

Slide 23

Slide 23 text

Dataﬂow Model One model unifying batch and streaming @meteatamel

Slide 24

Slide 24 text

What are you computing? Where in event time results are calculated? When in processing time are results materialized? How do reﬁnements relate?

Slide 25

Slide 25 text

What are you computing? What Where When How Element-Wise Aggregating Composite ParDo GroupByKey, Combine ParDo + Count + ParDo

Slide 26

Slide 26 text

What: Computing Integer Sums // Collection of raw log lines PCollection raw = IO.read(...); What Where When How // Element-wise transformation into team/score pairs PCollection> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection> scores = input.apply(Sum.integersPerKey());

Slide 27

Slide 27 text

What: Computing Integer Sums What Where When How

Slide 28

Slide 28 text

What: Computing Integer Sums What Where When How

Slide 29

Slide 29 text

Windowing divides data into event-time-based ﬁnite chunks. Often required when doing aggregations over unbounded data. Where in event time? What Where When How Fixed Sliding 1 2 3 5 4 Sessions 2 4 3 1 Key 2 Key 1 Key 3 Time 1 2 3 4

Slide 30

Slide 30 text

Where: Fixed 2-minute Windows What Where When How PCollection> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());

Slide 31

Slide 31 text

Where: Fixed 2-minute Windows What Where When How

Slide 32

Slide 32 text

When in processing time? What Where When How • Triggers control when results are emitted. • Triggers are often relative to the watermark.

Slide 33

Slide 33 text

When: Triggering at the Watermark What Where When How PCollection> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());

Slide 34

Slide 34 text

When: Triggering at the Watermark What Where When How

Slide 35

Slide 35 text

When: Early and Late Firings What Where When How PCollection> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());

Slide 36

Slide 36 text

When: Early and Late Firings What Where When How

Slide 37

Slide 37 text

How do reﬁnements relate? What Where When How • How should multiple outputs per window accumulate? • Should we emit the running sum, or only the values that have come in since the last result? (Accumulating & Retracting not yet implemented in Apache Beam.)

Slide 38

Slide 38 text

How: Add Newest, Remove Previous What Where When How PCollection> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());

Slide 39

Slide 39 text

1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Accumulations 4. Streaming with Speculative + Late Data Customizing What When Where How What Where When How

Slide 40

Slide 40 text

Dataﬂow to Apache Beam Evolution of Dataﬂow into Apache Beam @meteatamel

Slide 41

Slide 41 text

The Dataflow Model & Cloud Dataflow Dataflow Model & SDKs A unified model for batch and stream processing No-ops, fully managed service Google Cloud Dataflow

Slide 42

Slide 42 text

a unified model for batch and stream processing supporting multiple runtimes A great place to run Beam Apache Beam Google Cloud Dataflow The Dataflow Model & Cloud Dataflow Beam

Slide 43

Slide 43 text

1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for Existing Distributed Processing Backends • Apache Flink (thanks to data Artisans) • Apache Spark (thanks to Cloudera) • Google Cloud Dataﬂow (fully managed service) • Local (in-process) runner for testing What is Part of Apache Beam?

Slide 44

Slide 44 text

Apache Beam Vision: Mix/Match SDKs & runtimes ● The Beam Model: the abstractions at the core of Apache Beam Language B SDK Language A SDK Language C SDK Runner 1 Runner 3 Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language C Language B The Beam Model

Slide 45

Slide 45 text

Apache Beam Vision: as of March 2017 ● Beam’s Java SDK runs on multiple runtime environments, including: ○ Apache Apex ○ Apache Spark ○ Apache Flink ○ Google Cloud Dataflow ○ [in development] Apache Gearpump ● Cross-language infrastructure is in progress. ○ Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump

Slide 46

Slide 46 text

How do you build an abstraction layer? Apache Spark Cloud Dataﬂow Apache Flink ???????? ????????

Slide 47

Slide 47 text

Beam: the intersection of runner functionality?

Slide 48

Slide 48 text

Beam: the union of runner functionality?

Slide 49

Slide 49 text

Beam: the future!

Slide 50

Slide 50 text

Categorizing Runner capabilities http://beam.incubator.apache.org/ documentation/runners/capability-matrix/

Slide 51

Slide 51 text

Conﬁdential & Proprietary Google Cloud Platform 51 @meteatamel Data Processing with Apache Beam MapReduce BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel Cloud Dataﬂow Cloud Dataproc Apache Beam

Slide 52

Slide 52 text

Conﬁdential & Proprietary Google Cloud Platform 52 52 cloud.google.com/dataﬂow beam.apache.org @ApacheBeam Mete Atamel @meteatamel Thank You @meteatamel