2017 - Sourabh Bajaj - Big data processing with Apache Beam

Slide 1

Slide 1 text

Unified Batch and Stream Processing with Apache Beam PyBay 2017

Slide 2

Slide 2 text

I am Sourabh Hello!

Slide 3

Slide 3 text

I am Sourabh Hello! I am a Software Engineer

Slide 4

Slide 4 text

I am Sourabh Hello! I am a Software Engineer I tweet at @sb2nov

Slide 5

Slide 5 text

What is Apache Beam?

Slide 6

Slide 6 text

Apache Beam is a unified programming model for expressing efficient and portable data processing pipelines

Slide 7

Slide 7 text

Big Data

Slide 8

Slide 8 text

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg LAUNCH!!

Slide 9

Slide 9 text

DATA CAN BE BIG

Slide 10

Slide 10 text

… REALLY BIG ... Tuesday Wednesday Thursday

Slide 11

Slide 11 text

UNBOUNDED, DELAYED, OUT OF ORDER 9:00 8:00 14:00 13:00 12:00 11:00 10:00 8:00 8:00 8:00

Slide 12

Slide 12 text

ORGANIZING THE STREAM 8:00 8:00 8:00

Slide 13

Slide 13 text

DATA PROCESSING TRADEOFFS Completeness Latency $$$ Cost

Slide 14

Slide 14 text

WHAT IS IMPORTANT? Completeness Low Latency Low Cost Important Not Important $$$

Slide 15

Slide 15 text

MONTHLY BILLING Completeness Low Latency Low Cost Important Not Important $$$

Slide 16

Slide 16 text

BILLING ESTIMATE Completeness Low Latency Low Cost Important Not Important $$$

Slide 17

Slide 17 text

FRAUD DETECTION Completeness Low Latency Low Cost Important Not Important $$$

Slide 18

Slide 18 text

Beam Model

Slide 19

Slide 19 text

GENERATIONS BEYOND MAP-REDUCE Clearly separates event time from processing time Improved abstractions let you focus on your application logic Batch and stream processing are both first-class citizens

Slide 20

Slide 20 text

Pipeline PTransform PCollection (bounded or unbounded)

Slide 21

Slide 21 text

EVENT TIME VS PROCESSING TIME

Slide 22

Slide 22 text

EVENT TIME VS PROCESSING TIME

Slide 23

Slide 23 text

EVENT TIME VS PROCESSING TIME Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.

Slide 24

Slide 24 text

ASKING THE RIGHT QUESTIONS When in processing time? What is being computed? Where in event time? How do refinements happen?

Slide 25

Slide 25 text

WHAT IS BEING COMPUTED? scores: PCollection[KV[str, int]] = (input | beam.CombinePerKey(sum))

Slide 26

Slide 26 text

WHAT IS BEING COMPUTED? Element-Wise Aggregating Composite

Slide 27

Slide 27 text

WHAT IS BEING COMPUTED?

Slide 28

Slide 28 text

WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))

Slide 29

Slide 29 text

WHERE IN EVENT TIME?

Slide 30

Slide 30 text

WHERE IN EVENT TIME?

Slide 31

Slide 31 text

WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum)) The choice of windowing is retained through subsequent aggregations.

Slide 32

Slide 32 text

WHEN IN PROCESSING TIME? scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark()) | beam.CombinePerKey(sum))

Slide 33

Slide 33 text

WHEN IN PROCESSING TIME? Triggers control when results are emitted. Triggers are often relative to the watermark.

Slide 34

Slide 34 text

WHEN IN PROCESSING TIME?

Slide 35

Slide 35 text

HOW DO REFINEMENTS HAPPEN? scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark( early=trigger.AfterPeriod(1*60), late=trigger.AfterCount(1)), accumulation_mode=ACCUMULATING) | beam.CombinePerKey(sum))

Slide 36

Slide 36 text

HOW DO REFINEMENTS HAPPEN?

Slide 37

Slide 37 text

CUSTOMIZING WHAT WHERE WHEN HOW Classic Batch Windowed Batch Streaming Streaming + Accumulation For more information see https://cloud.google.com/dataflow/examples/gaming-example

Slide 38

Slide 38 text

Python SDK

Slide 39

Slide 39 text

39 SIMPLE PIPELINE with beam.Pipeline() as p: Pipeline construction is deferred.

Slide 40

Slide 40 text

40 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') lines is a PCollection, a deferred collection of all lines in the specified files.

Slide 41

Slide 41 text

41 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) The "pipe" operator applies a transformation (on the right) to a PCollection, reminiscent of bash. This will be applied to each line, resulting in a PCollection of words.

Slide 42

Slide 42 text

42 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = (words | beam.Map(lambda w: (w, 1)) | beam.CombinePerKey(sum)) Operations can be chained.

Slide 43

Slide 43 text

43 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() Composite operations easily defined.

Slide 44

Slide 44 text

44 SIMPLE PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() totals | beam.io.WriteTextFile('/path/to/output') (totals | beam.CombinePerKey(Largest(100)) | beam.io.WriteTextFile('/path/to/another/output') Finally, write the results somewhere. The pipeline actually executes on exiting its context. Pipelines are DAGs in general.

Slide 45

Slide 45 text

45 SIMPLE BATCH PIPELINE with beam.Pipeline() as p: lines = p | beam.io.ReadTextFile('/path/to/files') words = lines | beam.FlatMap(lambda line: re.findall('\w+', line)) totals = words | Count() totals | beam.io.WriteTextFile('/path/to/output') (totals | beam.CombinePerKey(Largest(100)) | beam.io.WriteTextFile('/path/to/another/output')

Slide 46

Slide 46 text

46 WHAT ABOUT STREAMING?

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Demo

Slide 49

Slide 49 text

WORD COUNT http://www.levraphael.com/blog/wp-content/uploads/2015/06/word-pile.jpg

Slide 50

Slide 50 text

TRENDING ON TWITTER http://thegetsmartblog.com/wp-content/uploads/2013/06/Twitter-trends-feature.png

Slide 51

Slide 51 text

Portability & Vision Google Cloud Dataflow

Slide 52

Slide 52 text

WHAT DOES APACHE BEAM PROVIDE? Runners for Existing Distributed Processing Backends The Beam Model: What / Where / When / How API (SDKs) for writing Beam pipelines Apache Apex Apache Flink InProcess / Local Apache Spark Google Cloud Dataflow Apache GearPump

Slide 53

Slide 53 text

Other Languages Beam Java Beam Python Pipeline SDK User facing SDK, defines a language specific API for the end user to specify the pipeline computation DAG.

Slide 54

Slide 54 text

Runner API Other Languages Beam Java Beam Python Runner API Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.

Slide 55

Slide 55 text

Runner API Other Languages Beam Java Beam Python Execution Execution Execution SDK Harness Docker based execution environments that are shared by all runners for running the user code in a consistent environment.

Slide 56

Slide 56 text

Fn API Runner API Other Languages Beam Java Beam Python Execution Execution Execution Fn API API which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.

Slide 57

Slide 57 text

Fn API Apache Flink Apache Spark Runner API Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gear- pump Apache Apex Runner Distributed processing environments that understand the runner API graph and how to execute the Beam model primitives.

Slide 58

Slide 58 text

BEAM RUNNER CAPABILITIES https://beam.apache.org/capability-matrix/

Slide 59

Slide 59 text

HOW CAN YOU HELP?

Slide 60

Slide 60 text

Runner 3 The Beam Model Language A SDK Language C SDK Runner 2 Runner 1 Software Development Kits (SDKs) Language B SDK Have a programming language you want to see in Beam; write an SDK.

Slide 61

Slide 61 text

Runner 3 The Beam Model Language A SDK Language C SDK Runner 2 Runner 1 Language B SDK Runners Google Cloud Dataflow Have an execution engine you want to see in Beam; write a runner.

Slide 62

Slide 62 text

The Beam Model Language A SDK Language C SDK Language B SDK Domain Specific extensions (DSLs) Have a target audience you want to see using Beam; write a DSL. DSL 3 DSL 2 DSL 1

Slide 63

Slide 63 text

The Beam Model Language A SDK Language C SDK Language B SDK Have shared components that can be part of larger pipelines; write a library. Library 3 Library 2 Library 1 Transform Libraries

Slide 64

Slide 64 text

The Beam Model Language A SDK Language C SDK Language B SDK Have a data storage or messaging system; write an IO connector. IO Connector IO Connector IO Connector IO Connectors

Slide 65

Slide 65 text

MORE BEAM? Issue tracker (https://issues.apache.org/jira/projects/BEAM) Beam website (https://beam.apache.org/) Source code (https://github.com/apache/beam) Developers mailing list ([email protected]) Users mailing list ([email protected]) Follow @ApacheBeam on Twitter

Slide 66

Slide 66 text

SUMMARY ● Beam helps you tackle big data that is: ○ Unbounded in volume ○ Out of order ○ Arbitrarily delayed ● The Beam model separates concerns of: ○ What is being computed? ○ Where in event time? ○ When in processing time? ○ How do refinements happen?

Slide 67

Slide 67 text

Thanks! You can find me at: @sb2nov Questions?