Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big data processing with Apache Beam

Big data processing with Apache Beam

In this talk, we present the new Python SDK for Apache Beam - a parallel programming model that allows one to implement batch and streaming data processing jobs that can run on a variety of execution engines like Apache Spark and Google Cloud Dataflow. We will use examples to discuss some of the interesting challenges in providing a Pythonic API and execution environment for distributed processing.

Sourabh

July 06, 2017
Tweet

More Decks by Sourabh

Other Decks in Programming

Transcript

  1. Data Processing with
    Apache Beam
    Pydata Seattle 2017

    View full-size slide

  2. I am Sourabh
    Hello!

    View full-size slide

  3. I am Sourabh
    Hello!
    I am a Software Engineer

    View full-size slide

  4. I am Sourabh
    Hello!
    I am a Software Engineer
    I tweet at @sb2nov

    View full-size slide

  5. I am Sourabh
    Hello!
    I am a Software Engineer
    I tweet at @sb2nov
    I like Ice Cream

    View full-size slide

  6. What is Apache Beam?

    View full-size slide

  7. Apache Beam is a unified
    programming model for
    expressing efficient and
    portable data processing
    pipelines

    View full-size slide

  8. https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
    LAUNCH!!

    View full-size slide

  9. DATA CAN BE BIG

    View full-size slide

  10. … REALLY BIG ...
    Tuesday
    Wednesday
    Thursday

    View full-size slide

  11. UNBOUNDED, DELAYED, OUT OF
    ORDER
    9:00
    8:00 14:00
    13:00
    12:00
    11:00
    10:00
    8:00
    8:00
    8:00

    View full-size slide

  12. ORGANIZING THE STREAM
    8:00
    8:00
    8:00

    View full-size slide

  13. DATA PROCESSING TRADEOFFS
    Completeness Latency
    $$$
    Cost

    View full-size slide

  14. WHAT IS IMPORTANT?
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View full-size slide

  15. MONTHLY BILLING
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View full-size slide

  16. BILLING ESTIMATE
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View full-size slide

  17. FRAUD DETECTION
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View full-size slide

  18. Pipeline
    PTransform
    PCollection
    (bounded or
    unbounded)

    View full-size slide

  19. EVENT TIME VS PROCESSING TIME

    View full-size slide

  20. ASKING THE RIGHT QUESTIONS
    When in processing time?
    What is being computed?
    Where in event time?
    How do refinements happen?

    View full-size slide

  21. WHAT IS BEING COMPUTED?
    scores: PCollection[KV[str, int]] = (input
    | Sum.integersPerKey())

    View full-size slide

  22. WHAT IS BEING COMPUTED?

    View full-size slide

  23. WHERE IN EVENT TIME?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60))
    | Sum.integersPerKey())

    View full-size slide

  24. WHERE IN EVENT TIME?

    View full-size slide

  25. WHERE IN EVENT TIME?

    View full-size slide

  26. WHEN IN PROCESSING TIME?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60)
    .triggering(AtWatermark()))
    | Sum.integersPerKey())

    View full-size slide

  27. WHEN IN PROCESSING TIME?

    View full-size slide

  28. HOW DO REFINEMENTS HAPPEN?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60)
    .triggering(AtWatermark()
    .withEarlyFirings(AtPeriod(1 * 60))
    .withLateFirings(AtCount(1)))
    .accumulatingFiredPanes())
    | Sum.integersPerKey())

    View full-size slide

  29. HOW DO REFINEMENTS HAPPEN?

    View full-size slide

  30. CUSTOMIZING WHAT WHERE WHEN HOW
    Classic
    Batch
    Windowed
    Batch
    Streaming Streaming +
    Accumulation
    For more information see https://cloud.google.com/dataflow/examples/gaming-example

    View full-size slide

  31. WORD COUNT
    http://www.levraphael.com/blog/wp-content/uploads/2015/06/word-pile.jpg

    View full-size slide

  32. WORD COUNT
    import apache_beam as beam, re

    View full-size slide

  33. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt"))

    View full-size slide

  34. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt")
    | beam.FlatMap(lamdba s: re.split("\\W+", s)))

    View full-size slide

  35. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt")
    | beam.FlatMap(lamdba s: re.split("\\W+", s))
    | beam.combiners.Count.PerElement())

    View full-size slide

  36. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt")
    | beam.FlatMap(lamdba s: re.split("\\W+", s))
    | beam.combiners.Count.PerElement()
    | beam.Map(lambda (w, c): "%s: %d" % (w, c)))

    View full-size slide

  37. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt")
    | beam.FlatMap(lamdba s: re.split("\\W+", s))
    | beam.combiners.Count.PerElement()
    | beam.Map(lambda (w, c): "%s: %d" % (w, c))
    | beam.io.textio.WriteToText("output/stringcounts"))

    View full-size slide

  38. TRENDING ON TWITTER
    http://thegetsmartblog.com/wp-content/uploads/2013/06/Twitter-trends-feature.png

    View full-size slide

  39. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic"))

    View full-size slide

  40. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic")
    | beam.WindowInto(SlidingWindows(5*60, 1*60)))

    View full-size slide

  41. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic")
    | beam.WindowInto(SlidingWindows(5*60, 1*60))
    | beam.ParDo(ParseHashTagDoFn()))

    View full-size slide

  42. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic")
    | beam.WindowInto(SlidingWindows(5*60, 1*60))
    | beam.ParDo(ParseHashTagDoFn())
    | beam.combiners.Count.PerElement())

    View full-size slide

  43. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic")
    | beam.WindowInto(SlidingWindows(5*60, 1*60))
    | beam.ParDo(ParseHashTagDoFn())
    | beam.combiners.Count.PerElement()
    | beam.ParDo(BigQueryOutputFormatDoFn())
    | beam.io.WriteToBigQuery("trends_table"))

    View full-size slide

  44. Portability
    &
    Vision
    Google Cloud
    Dataflow

    View full-size slide

  45. Other
    Languages
    Beam
    Java
    Beam
    Python Pipeline SDK
    User facing SDK, defines a language
    specific API for the end user to
    specify the pipeline computation
    DAG.

    View full-size slide

  46. Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python Runner API
    Runner and language agnostic
    representation of the user’s pipeline
    graph. It only contains nodes of Beam
    model primitives that all runners
    understand to maintain portability
    across runners.

    View full-size slide

  47. Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Execution
    SDK Harness
    Docker based execution
    environments that are shared by all
    runners for running the user code in a
    consistent environment.

    View full-size slide

  48. Fn API
    Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Execution
    Fn API
    API which the execution
    environments use to send and receive
    data, report metrics around execution
    of the user code with the Runner.

    View full-size slide

  49. Fn API
    Apache
    Flink
    Apache
    Spark
    Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Cloud
    Dataflow
    Execution
    Apache
    Gear-
    pump
    Apache
    Apex
    Runner
    Distributed processing environments
    that understand the runner API
    graph and how to execute the Beam
    model primitives.

    View full-size slide

  50. More Beam?
    Issue tracker (https://issues.apache.org/jira/projects/BEAM)
    Beam website (https://beam.apache.org/)
    Source code (https://github.com/apache/beam)
    Developers mailing list ([email protected])
    Users mailing list ([email protected])

    View full-size slide

  51. Thanks!
    You can find me at: @sb2nov
    Questions?

    View full-size slide