2017 - Sourabh Bajaj - Big data processing with Apache Beam

Unified Batch and Stream Processing with Apache Beam PyBay 2017

I am Sourabh Hello!

I am Sourabh Hello! I am a Software Engineer

I am Sourabh Hello! I am a Software Engineer I
tweet at @sb2nov

What is Apache Beam?

Apache Beam is a unified programming model for expressing efficient
and portable data processing pipelines

Big Data

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg LAUNCH!!

DATA CAN BE BIG

… REALLY BIG ... Tuesday Wednesday Thursday

UNBOUNDED, DELAYED, OUT OF ORDER 9:00 8:00 14:00 13:00 12:00
11:00 10:00 8:00 8:00 8:00

ORGANIZING THE STREAM 8:00 8:00 8:00

DATA PROCESSING TRADEOFFS Completeness Latency $$$ Cost

WHAT IS IMPORTANT? Completeness Low Latency Low Cost Important Not
Important $$$

MONTHLY BILLING Completeness Low Latency Low Cost Important Not Important
$$$

BILLING ESTIMATE Completeness Low Latency Low Cost Important Not Important
$$$

FRAUD DETECTION Completeness Low Latency Low Cost Important Not Important
$$$

Beam Model

GENERATIONS BEYOND MAP-REDUCE Clearly separates event time from processing time
Improved abstractions let you focus on your application logic Batch and stream processing are both first-class citizens

Pipeline PTransform PCollection (bounded or unbounded)

EVENT TIME VS PROCESSING TIME

EVENT TIME VS PROCESSING TIME Watermarks describe event time progress.
"No timestamp earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.

ASKING THE RIGHT QUESTIONS When in processing time? What is
being computed? Where in event time? How do refinements happen?

WHAT IS BEING COMPUTED? scores: PCollection[KV[str, int]] = (input |
beam.CombinePerKey(sum))

WHAT IS BEING COMPUTED? Element-Wise Aggregating Composite

WHAT IS BEING COMPUTED?

WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input |
beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))

WHERE IN EVENT TIME?

WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input |
beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum)) The choice of windowing is retained through subsequent aggregations.

WHEN IN PROCESSING TIME? scores: PCollection[KV[str, int]] = (input |
beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark()) | beam.CombinePerKey(sum))

WHEN IN PROCESSING TIME? Triggers control when results are emitted.
Triggers are often relative to the watermark.

WHEN IN PROCESSING TIME?

HOW DO REFINEMENTS HAPPEN? scores: PCollection[KV[str, int]] = (input |
beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark( early=trigger.AfterPeriod(1*60), late=trigger.AfterCount(1)), accumulation_mode=ACCUMULATING) | beam.CombinePerKey(sum))

HOW DO REFINEMENTS HAPPEN?

CUSTOMIZING WHAT WHERE WHEN HOW Classic Batch Windowed Batch Streaming
Streaming + Accumulation For more information see https://cloud.google.com/dataflow/examples/gaming-example

Python SDK

39 SIMPLE PIPELINE with beam.Pipeline() as p: Pipeline construction is
deferred.

40 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p
| beam.io.ReadTextFile('/path/to/files') lines is a PCollection, a deferred collection of all lines in the specified files.

| beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) The "pipe" operator applies a transformation (on the right) to a PCollection, reminiscent of bash. This will be applied to each line, resulting in a PCollection of words.

| beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = (words | beam.Map(lambda w: (w, 1)) | beam.CombinePerKey(sum)) Operations can be chained.

| beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() Composite operations easily defined.

| beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() totals | beam.io.WriteTextFile('/path/to/output') (totals | beam.CombinePerKey(Largest(100)) | beam.io.WriteTextFile('/path/to/another/output') Finally, write the results somewhere. The pipeline actually executes on exiting its context. Pipelines are DAGs in general.

45 SIMPLE BATCH PIPELINE with beam.Pipeline() as p: lines =
p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() totals | beam.io.WriteTextFile('/path/to/output') (totals | beam.CombinePerKey(Largest(100)) | beam.io.WriteTextFile('/path/to/another/output')

46 WHAT ABOUT STREAMING?

WORD COUNT http://www.levraphael.com/blog/wp-content/uploads/2015/06/word-pile.jpg

TRENDING ON TWITTER http://thegetsmartblog.com/wp-content/uploads/2013/06/Twitter-trends-feature.png

Portability & Vision Google Cloud Dataflow

WHAT DOES APACHE BEAM PROVIDE? Runners for Existing Distributed Processing
Backends The Beam Model: What / Where / When / How API (SDKs) for writing Beam pipelines Apache Apex Apache Flink InProcess / Local Apache Spark Google Cloud Dataflow Apache GearPump

Other Languages Beam Java Beam Python Pipeline SDK User facing
SDK, defines a language specific API for the end user to specify the pipeline computation DAG.

Runner API Other Languages Beam Java Beam Python Runner API
Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.

Runner API Other Languages Beam Java Beam Python Execution Execution
Execution SDK Harness Docker based execution environments that are shared by all runners for running the user code in a consistent environment.

Fn API Runner API Other Languages Beam Java Beam Python
Execution Execution Execution Fn API API which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.

Fn API Apache Flink Apache Spark Runner API Other Languages
Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gear- pump Apache Apex Runner Distributed processing environments that understand the runner API graph and how to execute the Beam model primitives.

BEAM RUNNER CAPABILITIES https://beam.apache.org/capability-matrix/

HOW CAN YOU HELP?

Runner 3 The Beam Model Language A SDK Language C
SDK Runner 2 Runner 1 Software Development Kits (SDKs) Language B SDK Have a programming language you want to see in Beam; write an SDK.

Runner 3 The Beam Model Language A SDK Language C
SDK Runner 2 Runner 1 Language B SDK Runners Google Cloud Dataflow Have an execution engine you want to see in Beam; write a runner.

The Beam Model Language A SDK Language C SDK Language
B SDK Domain Specific extensions (DSLs) Have a target audience you want to see using Beam; write a DSL. DSL 3 DSL 2 DSL 1

B SDK Have shared components that can be part of larger pipelines; write a library. Library 3 Library 2 Library 1 Transform Libraries

B SDK Have a data storage or messaging system; write an IO connector. IO Connector IO Connector IO Connector IO Connectors

MORE BEAM? Issue tracker (https://issues.apache.org/jira/projects/BEAM) Beam website (https://beam.apache.org/) Source code
(https://github.com/apache/beam) Developers mailing list ([email protected]) Users mailing list ([email protected]) Follow @ApacheBeam on Twitter

SUMMARY • Beam helps you tackle big data that is:
◦ Unbounded in volume ◦ Out of order ◦ Arbitrarily delayed • The Beam model separates concerns of: ◦ What is being computed? ◦ Where in event time? ◦ When in processing time? ◦ How do refinements happen?

Thanks! You can find me at: @sb2nov Questions?

2017 - Sourabh Bajaj - Big data processing with...

2017 - Sourabh Bajaj - Big data processing with Apache Beam

More Decks by PyBay

Other Decks in Programming

Featured

Transcript