Big data processing with Apache Beam

Slide 1

Slide 1 text

Data Processing with Apache Beam Pydata Seattle 2017

Slide 2

Slide 2 text

I am Sourabh Hello!

Slide 3

Slide 3 text

I am Sourabh Hello! I am a Software Engineer

Slide 4

Slide 4 text

I am Sourabh Hello! I am a Software Engineer I tweet at @sb2nov

Slide 5

Slide 5 text

I am Sourabh Hello! I am a Software Engineer I tweet at @sb2nov I like Ice Cream

Slide 6

Slide 6 text

What is Apache Beam?

Slide 7

Slide 7 text

Apache Beam is a unified programming model for expressing efficient and portable data processing pipelines

Slide 8

Slide 8 text

Big Data

Slide 9

Slide 9 text

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg LAUNCH!!

Slide 10

Slide 10 text

DATA CAN BE BIG

Slide 11

Slide 11 text

… REALLY BIG ... Tuesday Wednesday Thursday

Slide 12

Slide 12 text

UNBOUNDED, DELAYED, OUT OF ORDER 9:00 8:00 14:00 13:00 12:00 11:00 10:00 8:00 8:00 8:00

Slide 13

Slide 13 text

ORGANIZING THE STREAM 8:00 8:00 8:00

Slide 14

Slide 14 text

DATA PROCESSING TRADEOFFS Completeness Latency $$$ Cost

Slide 15

Slide 15 text

WHAT IS IMPORTANT? Completeness Low Latency Low Cost Important Not Important $$$

Slide 16

Slide 16 text

MONTHLY BILLING Completeness Low Latency Low Cost Important Not Important $$$

Slide 17

Slide 17 text

BILLING ESTIMATE Completeness Low Latency Low Cost Important Not Important $$$

Slide 18

Slide 18 text

FRAUD DETECTION Completeness Low Latency Low Cost Important Not Important $$$

Slide 19

Slide 19 text

Beam Model

Slide 20

Slide 20 text

Pipeline PTransform PCollection (bounded or unbounded)

Slide 21

Slide 21 text

EVENT TIME VS PROCESSING TIME

Slide 22

Slide 22 text

ASKING THE RIGHT QUESTIONS When in processing time? What is being computed? Where in event time? How do refinements happen?

Slide 23

Slide 23 text

WHAT IS BEING COMPUTED? scores: PCollection[KV[str, int]] = (input | Sum.integersPerKey())

Slide 24

Slide 24 text

WHAT IS BEING COMPUTED?

Slide 25

Slide 25 text

WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | Sum.integersPerKey())

Slide 26

Slide 26 text

WHERE IN EVENT TIME?

Slide 27

Slide 27 text

WHERE IN EVENT TIME?

Slide 28

Slide 28 text

WHEN IN PROCESSING TIME? scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark())) | Sum.integersPerKey())

Slide 29

Slide 29 text

WHEN IN PROCESSING TIME?

Slide 30

Slide 30 text

HOW DO REFINEMENTS HAPPEN? scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(1 * 60)) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) | Sum.integersPerKey())

Slide 31

Slide 31 text

HOW DO REFINEMENTS HAPPEN?

Slide 32

Slide 32 text

CUSTOMIZING WHAT WHERE WHEN HOW Classic Batch Windowed Batch Streaming Streaming + Accumulation For more information see https://cloud.google.com/dataflow/examples/gaming-example

Slide 33

Slide 33 text

Examples

Slide 34

Slide 34 text

WORD COUNT http://www.levraphael.com/blog/wp-content/uploads/2015/06/word-pile.jpg

Slide 35

Slide 35 text

WORD COUNT import apache_beam as beam, re

Slide 36

Slide 36 text

WORD COUNT import apache_beam as beam, re with beam.Pipeline() as p: (p | beam.io.textio.ReadFromText("input.txt"))

Slide 37

Slide 37 text

WORD COUNT import apache_beam as beam, re with beam.Pipeline() as p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)))

Slide 38

Slide 38 text

WORD COUNT import apache_beam as beam, re with beam.Pipeline() as p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement())

Slide 39

Slide 39 text

WORD COUNT import apache_beam as beam, re with beam.Pipeline() as p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement() | beam.Map(lambda (w, c): "%s: %d" % (w, c)))

Slide 40

Slide 40 text

Slide 41

Slide 41 text

TRENDING ON TWITTER http://thegetsmartblog.com/wp-content/uploads/2013/06/Twitter-trends-feature.png

Slide 42

Slide 42 text

TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic"))

Slide 43

Slide 43 text

TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic") | beam.WindowInto(SlidingWindows(5*60, 1*60)))

Slide 44

Slide 44 text

TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic") | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()))

Slide 45

Slide 45 text

TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic") | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()) | beam.combiners.Count.PerElement())

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Portability & Vision Google Cloud Dataflow

Slide 48

Slide 48 text

Other Languages Beam Java Beam Python Pipeline SDK User facing SDK, defines a language specific API for the end user to specify the pipeline computation DAG.

Slide 49

Slide 49 text

Runner API Other Languages Beam Java Beam Python Runner API Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.

Slide 50

Slide 50 text

Runner API Other Languages Beam Java Beam Python Execution Execution Execution SDK Harness Docker based execution environments that are shared by all runners for running the user code in a consistent environment.

Slide 51

Slide 51 text

Fn API Runner API Other Languages Beam Java Beam Python Execution Execution Execution Fn API API which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.

Slide 52

Slide 52 text

Fn API Apache Flink Apache Spark Runner API Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gear- pump Apache Apex Runner Distributed processing environments that understand the runner API graph and how to execute the Beam model primitives.

Slide 53

Slide 53 text

More Beam? Issue tracker (https://issues.apache.org/jira/projects/BEAM) Beam website (https://beam.apache.org/) Source code (https://github.com/apache/beam) Developers mailing list ([email protected]) Users mailing list ([email protected])

Slide 54

Slide 54 text

Thanks! You can find me at: @sb2nov Questions?