[Mete Atamel] Apache Beam and Google Cloud DataFlow

Apache Beam and Google Cloud Dataflow Mete Atamel Developer Advocate
at Google @meteatamel [email protected] @meteatamel +

Confidential & Proprietary Google Cloud Platform 2 MapReduce @meteatamel

(Produce) MapReduce: Batch Processing (Prepare) Map (Shuffle) Reduce @meteatamel

Confidential & Proprietary Google Cloud Platform 4 MapReduce BigTable Dremel
Colossus Flume Megastore Spanner PubSub Millwheel Cloud Dataflow Cloud Dataproc @meteatamel

2012 2002 2004 2006 2008 2010 MapReduce GFS Big Table
Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Data Processing @ Google @meteatamel

FlumeJava: Easy and Efficient MapReduce Pipelines • Higher-level API with
simple data processing abstractions. ◦ Focus on what you want to do to your data, not what the underlying system supports. • A graph of transformations is automatically transformed into an optimized series of MapReduces. @meteatamel

// Collection of raw events PCollection<SensorEvent> raw = ...; Example:
Computing mean temperature @meteatamel // Element-wise extract location/temperature pairs PCollection<KV<String, Double>> input = raw.apply(ParDo.of(new ParseFn())) // Composite transformation containing an aggregation PCollection<KV<String, Double>> output = input .apply(Mean.<Double>perKey()); // Write output output.apply(BigtableIO.Write.to(...));

@meteatamel

...big data... @meteatamel

...really, really big... Tuesday Wednesday Thursday @meteatamel

Batch failure mode #1 Latency @meteatamel

MapReduce Tuesday Wednesday Batch failure mode #2: Sessions Jose Lisa
Ingo Asha Cheryl Ari Wednesday Tuesday @meteatamel

Continuous & Unbounded 9:00 8:00 14:00 13:00 12:00 11:00 10:00
2:00 1:00 7:00 6:00 5:00 4:00 3:00 @meteatamel

Historical events Exact historical model Periodic batch processing Approximate real-time
model Stream processing system Continuous updates

2012 2002 2004 2006 2008 2010 MapReduce GFS Big Table
Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Data Processing @ Google @meteatamel

MillWheel: Streaming Computations • Framework for building low-latency data-processing applications
• User provides a DAG of computations to be performed • System manages state and persistent flow of elements @meteatamel

Streaming Patterns: Element-wise transformations 13:00 14:00 8:00 9:00 10:00 11:00
12:00 Processing Time

Streaming Patterns: Aggregating Time Based Windows 13:00 14:00 8:00 9:00
10:00 11:00 12:00 Processing Time

Streaming Patterns: Event-Time Based Windows Event Time Processing Time 11:00
10:00 15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00 Input Output

Streaming Patterns: Session Windows Event Time Processing Time 11:00 10:00
15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00 Input Output

Formalizing Event-Time Skew Watermarks describe event time progress. "No timestamp
earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.

Streaming or Batch? 1 + 1 = 2 $$$ Completeness
Latency Cost Why not both?

Dataflow Model One model unifying batch and streaming @meteatamel

What are you computing? Where in event time results are
calculated? When in processing time are results materialized? How do refinements relate?

What are you computing? What Where When How Element-Wise Aggregating
Composite ParDo GroupByKey, Combine ParDo + Count + ParDo

What: Computing Integer Sums // Collection of raw log lines
PCollection<String> raw = IO.read(...); What Where When How // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey());

What: Computing Integer Sums What Where When How

Windowing divides data into event-time-based finite chunks. Often required when
doing aggregations over unbounded data. Where in event time? What Where When How Fixed Sliding 1 2 3 5 4 Sessions 2 4 3 1 Key 2 Key 1 Key 3 Time 1 2 3 4

Where: Fixed 2-minute Windows What Where When How PCollection<KV<String, Integer>>
scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());

Where: Fixed 2-minute Windows What Where When How

When in processing time? What Where When How • Triggers
control when results are emitted. • Triggers are often relative to the watermark.

When: Triggering at the Watermark What Where When How PCollection<KV<String,
Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());

When: Triggering at the Watermark What Where When How

When: Early and Late Firings What Where When How PCollection<KV<String,
Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());

When: Early and Late Firings What Where When How

How do refinements relate? What Where When How • How
should multiple outputs per window accumulate? • Should we emit the running sum, or only the values that have come in since the last result? (Accumulating & Retracting not yet implemented in Apache Beam.)

How: Add Newest, Remove Previous What Where When How PCollection<KV<String,
Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());

1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5.
Streaming With Accumulations 4. Streaming with Speculative + Late Data Customizing What When Where How What Where When How

Dataflow to Apache Beam Evolution of Dataflow into Apache Beam
@meteatamel

The Dataflow Model & Cloud Dataflow Dataflow Model & SDKs
A unified model for batch and stream processing No-ops, fully managed service Google Cloud Dataflow

a unified model for batch and stream processing supporting multiple
runtimes A great place to run Beam Apache Beam Google Cloud Dataflow The Dataflow Model & Cloud Dataflow Beam

1. The Beam Model: What / Where / When /
How 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for Existing Distributed Processing Backends • Apache Flink (thanks to data Artisans) • Apache Spark (thanks to Cloudera) • Google Cloud Dataflow (fully managed service) • Local (in-process) runner for testing What is Part of Apache Beam?

Apache Beam Vision: Mix/Match SDKs & runtimes • The Beam
Model: the abstractions at the core of Apache Beam Language B SDK Language A SDK Language C SDK Runner 1 Runner 3 Runner 2 • Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling • Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not • Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language C Language B The Beam Model

Apache Beam Vision: as of March 2017 • Beam’s Java
SDK runs on multiple runtime environments, including: ◦ Apache Apex ◦ Apache Spark ◦ Apache Flink ◦ Google Cloud Dataflow ◦ [in development] Apache Gearpump • Cross-language infrastructure is in progress. ◦ Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump

How do you build an abstraction layer? Apache Spark Cloud
Dataflow Apache Flink ???????? ????????

Beam: the intersection of runner functionality?

Beam: the union of runner functionality?

Beam: the future!

Categorizing Runner capabilities http://beam.incubator.apache.org/ documentation/runners/capability-matrix/

Confidential & Proprietary Google Cloud Platform 51 @meteatamel Data Processing
with Apache Beam MapReduce BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel Cloud Dataflow Cloud Dataproc Apache Beam

Confidential & Proprietary Google Cloud Platform 52 52 cloud.google.com/dataflow beam.apache.org
@ApacheBeam Mete Atamel @meteatamel [email protected] meteatamel.wordpress.com Thank You @meteatamel

[Mete Atamel] Apache Beam and Google Cloud Data...

[Mete Atamel] Apache Beam and Google Cloud DataFlow

More Decks by Google Developers Group Lviv

Other Decks in Technology

Featured

Transcript