GENERATIONS BEYOND MAP-REDUCE Clearly separates event time from processing time Improved abstractions let you focus on your application logic Batch and stream processing are both first-class citizens
EVENT TIME VS PROCESSING TIME Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum)) The choice of windowing is retained through subsequent aggregations.
CUSTOMIZING WHAT WHERE WHEN HOW Classic Batch Windowed Batch Streaming Streaming + Accumulation For more information see https://cloud.google.com/dataflow/examples/gaming-example
40 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') lines is a PCollection, a deferred collection of all lines in the specified files.
41 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) The "pipe" operator applies a transformation (on the right) to a PCollection, reminiscent of bash. This will be applied to each line, resulting in a PCollection of words.
44 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() totals | beam.io.WriteTextFile('/path/to/output') (totals | beam.CombinePerKey(Largest(100)) | beam.io.WriteTextFile('/path/to/another/output') Finally, write the results somewhere. The pipeline actually executes on exiting its context. Pipelines are DAGs in general.
WHAT DOES APACHE BEAM PROVIDE? Runners for Existing Distributed Processing Backends The Beam Model: What / Where / When / How API (SDKs) for writing Beam pipelines Apache Apex Apache Flink InProcess / Local Apache Spark Google Cloud Dataflow Apache GearPump
Other Languages Beam Java Beam Python Pipeline SDK User facing SDK, defines a language specific API for the end user to specify the pipeline computation DAG.
Runner API Other Languages Beam Java Beam Python Runner API Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.
Runner API Other Languages Beam Java Beam Python Execution Execution Execution SDK Harness Docker based execution environments that are shared by all runners for running the user code in a consistent environment.
Fn API Runner API Other Languages Beam Java Beam Python Execution Execution Execution Fn API API which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.
Fn API Apache Flink Apache Spark Runner API Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gear- pump Apache Apex Runner Distributed processing environments that understand the runner API graph and how to execute the Beam model primitives.
SUMMARY ● Beam helps you tackle big data that is: ○ Unbounded in volume ○ Out of order ○ Arbitrarily delayed ● The Beam model separates concerns of: ○ What is being computed? ○ Where in event time? ○ When in processing time? ○ How do refinements happen?