Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big data processing with Apache Beam

Big data processing with Apache Beam

In this talk, we present the new Python SDK for Apache Beam - a parallel programming model that allows one to implement batch and streaming data processing jobs that can run on a variety of execution engines like Apache Spark and Google Cloud Dataflow. We will use examples to discuss some of the interesting challenges in providing a Pythonic API and execution environment for distributed processing.

Sourabh

July 06, 2017
Tweet

More Decks by Sourabh

Other Decks in Programming

Transcript

  1. I am Sourabh Hello! I am a Software Engineer I

    tweet at @sb2nov I like Ice Cream
  2. ASKING THE RIGHT QUESTIONS When in processing time? What is

    being computed? Where in event time? How do refinements happen?
  3. WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60)) | Sum.integersPerKey())
  4. WHEN IN PROCESSING TIME? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark())) | Sum.integersPerKey())
  5. HOW DO REFINEMENTS HAPPEN? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(1 * 60)) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) | Sum.integersPerKey())
  6. CUSTOMIZING WHAT WHERE WHEN HOW Classic Batch Windowed Batch Streaming

    Streaming + Accumulation For more information see https://cloud.google.com/dataflow/examples/gaming-example
  7. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt"))
  8. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)))
  9. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement())
  10. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement() | beam.Map(lambda (w, c): "%s: %d" % (w, c)))
  11. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement() | beam.Map(lambda (w, c): "%s: %d" % (w, c)) | beam.io.textio.WriteToText("output/stringcounts"))
  12. TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic")

    | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()))
  13. TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic")

    | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()) | beam.combiners.Count.PerElement())
  14. TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic")

    | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()) | beam.combiners.Count.PerElement() | beam.ParDo(BigQueryOutputFormatDoFn()) | beam.io.WriteToBigQuery("trends_table"))
  15. Other Languages Beam Java Beam Python Pipeline SDK User facing

    SDK, defines a language specific API for the end user to specify the pipeline computation DAG.
  16. Runner API Other Languages Beam Java Beam Python Runner API

    Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.
  17. Runner API Other Languages Beam Java Beam Python Execution Execution

    Execution SDK Harness Docker based execution environments that are shared by all runners for running the user code in a consistent environment.
  18. Fn API Runner API Other Languages Beam Java Beam Python

    Execution Execution Execution Fn API API which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.
  19. Fn API Apache Flink Apache Spark Runner API Other Languages

    Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gear- pump Apache Apex Runner Distributed processing environments that understand the runner API graph and how to execute the Beam model primitives.