Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2017 - Sourabh Bajaj - Big data processing with Apache Beam

PyBay
August 13, 2017

2017 - Sourabh Bajaj - Big data processing with Apache Beam

Description
In this talk, we present the new Python SDK for Apache Beam - a parallel programming model that allows one to implement batch and streaming data processing jobs that can run on a variety of execution engines like Apache Spark and Google Cloud Dataflow.

Abstract
Currently, some popular data processing frameworks such as Apache Spark consider batch and stream processing jobs independently. The APIs across different processing systems such as Apache Spark or Apache Flink are also different. This forces the end user to learn a potentially new system every time. Apache Beam [1] addresses this problem by providing a unified programming model that can be used for both batch and streaming pipelines. The Beam SDK allows the user to execute these pipelines against different execution engines. Currently, Apache Beam provides a Java and Python SDK.

In the talk, we start off by providing an overview of Apache Beam using the Python SDK and the problems it tries to address from an end user’s perspective. We cover the core programming constructs in the Beam model such as PCollections, ParDo, GroupByKey, windowing, and triggers. We describe how these constructs make it possible for pipelines to be executed in a unified fashion in both batch and streaming. Then we use examples to demonstrate these capabilities. The examples showcase using Beam for stream processing and real-time data analysis, and how Beam can be used for feature engineering in some Machine Learning applications using Tensorflow. Finally, we end with Beam's vision of creating runner and execution independent graphs using the Beam FnApi [2].

Apache Beam [1] is a top-level Apache project and is completely open source. The code for Beam can be found on Github [3].

[1] https://beam.apache.org/ [2] http://s.apache.org/beam-fn-api [3] https://github.com/apache/beam

Bio
Sourabh is a software engineer at Google interested in Data Infrastructure and Machine Learning. He currently works on Apache Beam. Prior to Google he was part of the Data Science team at Coursera working on everything from Recommendation System to Data warehousing.

PyBay

August 13, 2017
Tweet

More Decks by PyBay

Other Decks in Programming

Transcript

  1. GENERATIONS BEYOND MAP-REDUCE Clearly separates event time from processing time

    Improved abstractions let you focus on your application logic Batch and stream processing are both first-class citizens
  2. EVENT TIME VS PROCESSING TIME Watermarks describe event time progress.

    "No timestamp earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  3. ASKING THE RIGHT QUESTIONS When in processing time? What is

    being computed? Where in event time? How do refinements happen?
  4. WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))
  5. WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum)) The choice of windowing is retained through subsequent aggregations.
  6. WHEN IN PROCESSING TIME? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark()) | beam.CombinePerKey(sum))
  7. WHEN IN PROCESSING TIME? Triggers control when results are emitted.

    Triggers are often relative to the watermark.
  8. HOW DO REFINEMENTS HAPPEN? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark( early=trigger.AfterPeriod(1*60), late=trigger.AfterCount(1)), accumulation_mode=ACCUMULATING) | beam.CombinePerKey(sum))
  9. CUSTOMIZING WHAT WHERE WHEN HOW Classic Batch Windowed Batch Streaming

    Streaming + Accumulation For more information see https://cloud.google.com/dataflow/examples/gaming-example
  10. 40 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p

    | beam.io.ReadTextFile('/path/to/files') lines is a PCollection, a deferred collection of all lines in the specified files.
  11. 41 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p

    | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) The "pipe" operator applies a transformation (on the right) to a PCollection, reminiscent of bash. This will be applied to each line, resulting in a PCollection of words.
  12. 42 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p

    | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = (words | beam.Map(lambda w: (w, 1)) | beam.CombinePerKey(sum)) Operations can be chained.
  13. 43 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p

    | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() Composite operations easily defined.
  14. 44 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p

    | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() totals | beam.io.WriteTextFile('/path/to/output') (totals | beam.CombinePerKey(Largest(100)) | beam.io.WriteTextFile('/path/to/another/output') Finally, write the results somewhere. The pipeline actually executes on exiting its context. Pipelines are DAGs in general.
  15. 45 SIMPLE BATCH PIPELINE with beam.Pipeline() as p: lines =

    p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() totals | beam.io.WriteTextFile('/path/to/output') (totals | beam.CombinePerKey(Largest(100)) | beam.io.WriteTextFile('/path/to/another/output')
  16. 47 SIMPLE STREAMING PIPELINE with beam.Pipeline() as p: lines =

    p | beam.io.ReadPubSub(...) | WindowInto(...) words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() totals | beam.io.WriteTextFile('/path/to/output') (totals | beam.CombinePerKey(Largest(100)) | beam.io.WriteTextFile('/path/to/another/output')
  17. WHAT DOES APACHE BEAM PROVIDE? Runners for Existing Distributed Processing

    Backends The Beam Model: What / Where / When / How API (SDKs) for writing Beam pipelines Apache Apex Apache Flink InProcess / Local Apache Spark Google Cloud Dataflow Apache GearPump
  18. Other Languages Beam Java Beam Python Pipeline SDK User facing

    SDK, defines a language specific API for the end user to specify the pipeline computation DAG.
  19. Runner API Other Languages Beam Java Beam Python Runner API

    Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.
  20. Runner API Other Languages Beam Java Beam Python Execution Execution

    Execution SDK Harness Docker based execution environments that are shared by all runners for running the user code in a consistent environment.
  21. Fn API Runner API Other Languages Beam Java Beam Python

    Execution Execution Execution Fn API API which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.
  22. Fn API Apache Flink Apache Spark Runner API Other Languages

    Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gear- pump Apache Apex Runner Distributed processing environments that understand the runner API graph and how to execute the Beam model primitives.
  23. Runner 3 The Beam Model Language A SDK Language C

    SDK Runner 2 Runner 1 Software Development Kits (SDKs) Language B SDK Have a programming language you want to see in Beam; write an SDK.
  24. Runner 3 The Beam Model Language A SDK Language C

    SDK Runner 2 Runner 1 Language B SDK Runners Google Cloud Dataflow Have an execution engine you want to see in Beam; write a runner.
  25. The Beam Model Language A SDK Language C SDK Language

    B SDK Domain Specific extensions (DSLs) Have a target audience you want to see using Beam; write a DSL. DSL 3 DSL 2 DSL 1
  26. The Beam Model Language A SDK Language C SDK Language

    B SDK Have shared components that can be part of larger pipelines; write a library. Library 3 Library 2 Library 1 Transform Libraries
  27. The Beam Model Language A SDK Language C SDK Language

    B SDK Have a data storage or messaging system; write an IO connector. IO Connector IO Connector IO Connector IO Connectors
  28. MORE BEAM? Issue tracker (https://issues.apache.org/jira/projects/BEAM) Beam website (https://beam.apache.org/) Source code

    (https://github.com/apache/beam) Developers mailing list ([email protected]) Users mailing list ([email protected]) Follow @ApacheBeam on Twitter
  29. SUMMARY • Beam helps you tackle big data that is:

    ◦ Unbounded in volume ◦ Out of order ◦ Arbitrarily delayed • The Beam model separates concerns of: ◦ What is being computed? ◦ Where in event time? ◦ When in processing time? ◦ How do refinements happen?