Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Beam and Dataflow

Apache Beam and Dataflow

Mete Atamel

October 16, 2017
Tweet

More Decks by Mete Atamel

Other Decks in Programming

Transcript

  1. Confidential & Proprietary Google Cloud Platform 4 MapReduce BigTable Dremel

    Colossus Flume Megastore Spanner PubSub Millwheel Cloud Dataflow Cloud Dataproc @meteatamel
  2. 2012 2002 2004 2006 2008 2010 MapReduce GFS Big Table

    Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Data Processing @ Google @meteatamel
  3. FlumeJava: Easy and Efficient MapReduce Pipelines • Higher-level API with

    simple data processing abstractions. ◦ Focus on what you want to do to your data, not what the underlying system supports. • A graph of transformations is automatically transformed into an optimized series of MapReduces. @meteatamel
  4. // Collection of raw events PCollection<SensorEvent> raw = ...; Example:

    Computing mean temperature @meteatamel // Element-wise extract location/temperature pairs PCollection<KV<String, Double>> input = raw.apply(ParDo.of(new ParseFn())) // Composite transformation containing an aggregation PCollection<KV<String, Double>> output = input .apply(Mean.<Double>perKey()); // Write output output.apply(BigtableIO.Write.to(...));
  5. MapReduce Tuesday Wednesday Batch failure mode #2: Sessions Jose Lisa

    Ingo Asha Cheryl Ari Wednesday Tuesday @meteatamel
  6. Continuous & Unbounded 9:00 8:00 14:00 13:00 12:00 11:00 10:00

    2:00 1:00 7:00 6:00 5:00 4:00 3:00 @meteatamel
  7. 2012 2002 2004 2006 2008 2010 MapReduce GFS Big Table

    Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Data Processing @ Google @meteatamel
  8. MillWheel: Streaming Computations • Framework for building low-latency data-processing applications

    • User provides a DAG of computations to be performed • System manages state and persistent flow of elements @meteatamel
  9. Streaming Patterns: Event-Time Based Windows Event Time Processing Time 11:00

    10:00 15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00 Input Output
  10. Streaming Patterns: Session Windows Event Time Processing Time 11:00 10:00

    15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00 Input Output
  11. Formalizing Event-Time Skew Watermarks describe event time progress. "No timestamp

    earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  12. What are you computing? Where in event time results are

    calculated? When in processing time are results materialized? How do refinements relate?
  13. What are you computing? What Where When How Element-Wise Aggregating

    Composite ParDo GroupByKey, Combine ParDo + Count + ParDo
  14. What: Computing Integer Sums // Collection of raw log lines

    PCollection<String> raw = IO.read(...); What Where When How // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey());
  15. Windowing divides data into event-time-based finite chunks. Often required when

    doing aggregations over unbounded data. Where in event time? What Where When How Fixed Sliding 1 2 3 5 4 Sessions 2 4 3 1 Key 2 Key 1 Key 3 Time 1 2 3 4
  16. Where: Fixed 2-minute Windows What Where When How PCollection<KV<String, Integer>>

    scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());
  17. When in processing time? What Where When How • Triggers

    control when results are emitted. • Triggers are often relative to the watermark.
  18. When: Triggering at the Watermark What Where When How PCollection<KV<String,

    Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  19. When: Early and Late Firings What Where When How PCollection<KV<String,

    Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
  20. How do refinements relate? What Where When How • How

    should multiple outputs per window accumulate? • Should we emit the running sum, or only the values that have come in since the last result? (Accumulating & Retracting not yet implemented in Apache Beam.)
  21. How: Add Newest, Remove Previous What Where When How PCollection<KV<String,

    Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());
  22. 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5.

    Streaming With Accumulations 4. Streaming with Speculative + Late Data Customizing What When Where How What Where When How
  23. The Dataflow Model & Cloud Dataflow Dataflow Model & SDKs

    A unified model for batch and stream processing No-ops, fully managed service Google Cloud Dataflow
  24. a unified model for batch and stream processing supporting multiple

    runtimes A great place to run Beam Apache Beam Google Cloud Dataflow The Dataflow Model & Cloud Dataflow Beam
  25. 1. The Beam Model: What / Where / When /

    How 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for Existing Distributed Processing Backends • Apache Flink (thanks to data Artisans) • Apache Spark (thanks to Cloudera) • Google Cloud Dataflow (fully managed service) • Local (in-process) runner for testing What is Part of Apache Beam?
  26. Apache Beam Vision: Mix/Match SDKs & runtimes • The Beam

    Model: the abstractions at the core of Apache Beam Language B SDK Language A SDK Language C SDK Runner 1 Runner 3 Runner 2 • Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling • Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not • Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language C Language B The Beam Model
  27. Apache Beam Vision: as of March 2017 • Beam’s Java

    SDK runs on multiple runtime environments, including: ◦ Apache Apex ◦ Apache Spark ◦ Apache Flink ◦ Google Cloud Dataflow ◦ [in development] Apache Gearpump • Cross-language infrastructure is in progress. ◦ Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
  28. How do you build an abstraction layer? Apache Spark Cloud

    Dataflow Apache Flink ???????? ????????
  29. Confidential & Proprietary Google Cloud Platform 51 @meteatamel Data Processing

    with Apache Beam MapReduce BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel Cloud Dataflow Cloud Dataproc Apache Beam