Big data processing with Apache Beam

Data Processing with Apache Beam Pydata Seattle 2017

I am Sourabh Hello!

I am Sourabh Hello! I am a Software Engineer

I am Sourabh Hello! I am a Software Engineer I
tweet at @sb2nov

I am Sourabh Hello! I am a Software Engineer I
tweet at @sb2nov I like Ice Cream

What is Apache Beam?

Apache Beam is a unified programming model for expressing efficient
and portable data processing pipelines

Big Data

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg LAUNCH!!

DATA CAN BE BIG

… REALLY BIG ... Tuesday Wednesday Thursday

UNBOUNDED, DELAYED, OUT OF ORDER 9:00 8:00 14:00 13:00 12:00
11:00 10:00 8:00 8:00 8:00

ORGANIZING THE STREAM 8:00 8:00 8:00

DATA PROCESSING TRADEOFFS Completeness Latency $$$ Cost

WHAT IS IMPORTANT? Completeness Low Latency Low Cost Important Not
Important $$$

MONTHLY BILLING Completeness Low Latency Low Cost Important Not Important
$$$

BILLING ESTIMATE Completeness Low Latency Low Cost Important Not Important
$$$

FRAUD DETECTION Completeness Low Latency Low Cost Important Not Important
$$$

Beam Model

Pipeline PTransform PCollection (bounded or unbounded)

EVENT TIME VS PROCESSING TIME

ASKING THE RIGHT QUESTIONS When in processing time? What is
being computed? Where in event time? How do refinements happen?

WHAT IS BEING COMPUTED? scores: PCollection[KV[str, int]] = (input |
Sum.integersPerKey())

WHAT IS BEING COMPUTED?

WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input |
beam.WindowInto(FixedWindows(2 * 60)) | Sum.integersPerKey())

WHERE IN EVENT TIME?

WHEN IN PROCESSING TIME? scores: PCollection[KV[str, int]] = (input |
beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark())) | Sum.integersPerKey())

WHEN IN PROCESSING TIME?

HOW DO REFINEMENTS HAPPEN? scores: PCollection[KV[str, int]] = (input |
beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(1 * 60)) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) | Sum.integersPerKey())

HOW DO REFINEMENTS HAPPEN?

CUSTOMIZING WHAT WHERE WHEN HOW Classic Batch Windowed Batch Streaming
Streaming + Accumulation For more information see https://cloud.google.com/dataflow/examples/gaming-example

Examples

WORD COUNT http://www.levraphael.com/blog/wp-content/uploads/2015/06/word-pile.jpg

WORD COUNT import apache_beam as beam, re

WORD COUNT import apache_beam as beam, re with beam.Pipeline() as
p: (p | beam.io.textio.ReadFromText("input.txt"))

p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)))

p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement())

p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement() | beam.Map(lambda (w, c): "%s: %d" % (w, c)))

TRENDING ON TWITTER http://thegetsmartblog.com/wp-content/uploads/2013/06/Twitter-trends-feature.png

TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic"))

TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic")
| beam.WindowInto(SlidingWindows(5*60, 1*60)))

| beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()))

| beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()) | beam.combiners.Count.PerElement())

Portability & Vision Google Cloud Dataflow

Other Languages Beam Java Beam Python Pipeline SDK User facing
SDK, defines a language specific API for the end user to specify the pipeline computation DAG.

Runner API Other Languages Beam Java Beam Python Runner API
Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.

Runner API Other Languages Beam Java Beam Python Execution Execution
Execution SDK Harness Docker based execution environments that are shared by all runners for running the user code in a consistent environment.

Fn API Runner API Other Languages Beam Java Beam Python
Execution Execution Execution Fn API API which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.

Fn API Apache Flink Apache Spark Runner API Other Languages
Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gear- pump Apache Apex Runner Distributed processing environments that understand the runner API graph and how to execute the Beam model primitives.

More Beam? Issue tracker (https://issues.apache.org/jira/projects/BEAM) Beam website (https://beam.apache.org/) Source code
(https://github.com/apache/beam) Developers mailing list ([email protected]) Users mailing list ([email protected])

Thanks! You can find me at: @sb2nov Questions?

Big data processing with Apache Beam

Big data processing with Apache Beam

More Decks by Sourabh

Other Decks in Programming

Featured

Transcript