Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big data processing with Apache Beam

Big data processing with Apache Beam

In this talk, we present the new Python SDK for Apache Beam - a parallel programming model that allows one to implement batch and streaming data processing jobs that can run on a variety of execution engines like Apache Spark and Google Cloud Dataflow. We will use examples to discuss some of the interesting challenges in providing a Pythonic API and execution environment for distributed processing.

Sourabh

July 06, 2017
Tweet

More Decks by Sourabh

Other Decks in Programming

Transcript

  1. Data Processing with
    Apache Beam
    Pydata Seattle 2017

    View Slide

  2. I am Sourabh
    Hello!

    View Slide

  3. I am Sourabh
    Hello!
    I am a Software Engineer

    View Slide

  4. I am Sourabh
    Hello!
    I am a Software Engineer
    I tweet at @sb2nov

    View Slide

  5. I am Sourabh
    Hello!
    I am a Software Engineer
    I tweet at @sb2nov
    I like Ice Cream

    View Slide

  6. What is Apache Beam?

    View Slide

  7. Apache Beam is a unified
    programming model for
    expressing efficient and
    portable data processing
    pipelines

    View Slide

  8. Big Data

    View Slide

  9. https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
    LAUNCH!!

    View Slide

  10. DATA CAN BE BIG

    View Slide

  11. … REALLY BIG ...
    Tuesday
    Wednesday
    Thursday

    View Slide

  12. UNBOUNDED, DELAYED, OUT OF
    ORDER
    9:00
    8:00 14:00
    13:00
    12:00
    11:00
    10:00
    8:00
    8:00
    8:00

    View Slide

  13. ORGANIZING THE STREAM
    8:00
    8:00
    8:00

    View Slide

  14. DATA PROCESSING TRADEOFFS
    Completeness Latency
    $$$
    Cost

    View Slide

  15. WHAT IS IMPORTANT?
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View Slide

  16. MONTHLY BILLING
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View Slide

  17. BILLING ESTIMATE
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View Slide

  18. FRAUD DETECTION
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View Slide

  19. Beam
    Model

    View Slide

  20. Pipeline
    PTransform
    PCollection
    (bounded or
    unbounded)

    View Slide

  21. EVENT TIME VS PROCESSING TIME

    View Slide

  22. ASKING THE RIGHT QUESTIONS
    When in processing time?
    What is being computed?
    Where in event time?
    How do refinements happen?

    View Slide

  23. WHAT IS BEING COMPUTED?
    scores: PCollection[KV[str, int]] = (input
    | Sum.integersPerKey())

    View Slide

  24. WHAT IS BEING COMPUTED?

    View Slide

  25. WHERE IN EVENT TIME?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60))
    | Sum.integersPerKey())

    View Slide

  26. WHERE IN EVENT TIME?

    View Slide

  27. WHERE IN EVENT TIME?

    View Slide

  28. WHEN IN PROCESSING TIME?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60)
    .triggering(AtWatermark()))
    | Sum.integersPerKey())

    View Slide

  29. WHEN IN PROCESSING TIME?

    View Slide

  30. HOW DO REFINEMENTS HAPPEN?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60)
    .triggering(AtWatermark()
    .withEarlyFirings(AtPeriod(1 * 60))
    .withLateFirings(AtCount(1)))
    .accumulatingFiredPanes())
    | Sum.integersPerKey())

    View Slide

  31. HOW DO REFINEMENTS HAPPEN?

    View Slide

  32. CUSTOMIZING WHAT WHERE WHEN HOW
    Classic
    Batch
    Windowed
    Batch
    Streaming Streaming +
    Accumulation
    For more information see https://cloud.google.com/dataflow/examples/gaming-example

    View Slide

  33. Examples

    View Slide

  34. WORD COUNT
    http://www.levraphael.com/blog/wp-content/uploads/2015/06/word-pile.jpg

    View Slide

  35. WORD COUNT
    import apache_beam as beam, re

    View Slide

  36. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt"))

    View Slide

  37. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt")
    | beam.FlatMap(lamdba s: re.split("\\W+", s)))

    View Slide

  38. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt")
    | beam.FlatMap(lamdba s: re.split("\\W+", s))
    | beam.combiners.Count.PerElement())

    View Slide

  39. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt")
    | beam.FlatMap(lamdba s: re.split("\\W+", s))
    | beam.combiners.Count.PerElement()
    | beam.Map(lambda (w, c): "%s: %d" % (w, c)))

    View Slide

  40. WORD COUNT
    import apache_beam as beam, re
    with beam.Pipeline() as p:
    (p
    | beam.io.textio.ReadFromText("input.txt")
    | beam.FlatMap(lamdba s: re.split("\\W+", s))
    | beam.combiners.Count.PerElement()
    | beam.Map(lambda (w, c): "%s: %d" % (w, c))
    | beam.io.textio.WriteToText("output/stringcounts"))

    View Slide

  41. TRENDING ON TWITTER
    http://thegetsmartblog.com/wp-content/uploads/2013/06/Twitter-trends-feature.png

    View Slide

  42. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic"))

    View Slide

  43. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic")
    | beam.WindowInto(SlidingWindows(5*60, 1*60)))

    View Slide

  44. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic")
    | beam.WindowInto(SlidingWindows(5*60, 1*60))
    | beam.ParDo(ParseHashTagDoFn()))

    View Slide

  45. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic")
    | beam.WindowInto(SlidingWindows(5*60, 1*60))
    | beam.ParDo(ParseHashTagDoFn())
    | beam.combiners.Count.PerElement())

    View Slide

  46. TRENDING ON TWITTER
    with beam.Pipeline() as p:
    (p
    | beam.io.ReadStringsFromPubSub("twitter_topic")
    | beam.WindowInto(SlidingWindows(5*60, 1*60))
    | beam.ParDo(ParseHashTagDoFn())
    | beam.combiners.Count.PerElement()
    | beam.ParDo(BigQueryOutputFormatDoFn())
    | beam.io.WriteToBigQuery("trends_table"))

    View Slide

  47. Portability
    &
    Vision
    Google Cloud
    Dataflow

    View Slide

  48. Other
    Languages
    Beam
    Java
    Beam
    Python Pipeline SDK
    User facing SDK, defines a language
    specific API for the end user to
    specify the pipeline computation
    DAG.

    View Slide

  49. Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python Runner API
    Runner and language agnostic
    representation of the user’s pipeline
    graph. It only contains nodes of Beam
    model primitives that all runners
    understand to maintain portability
    across runners.

    View Slide

  50. Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Execution
    SDK Harness
    Docker based execution
    environments that are shared by all
    runners for running the user code in a
    consistent environment.

    View Slide

  51. Fn API
    Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Execution
    Fn API
    API which the execution
    environments use to send and receive
    data, report metrics around execution
    of the user code with the Runner.

    View Slide

  52. Fn API
    Apache
    Flink
    Apache
    Spark
    Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Cloud
    Dataflow
    Execution
    Apache
    Gear-
    pump
    Apache
    Apex
    Runner
    Distributed processing environments
    that understand the runner API
    graph and how to execute the Beam
    model primitives.

    View Slide

  53. More Beam?
    Issue tracker (https://issues.apache.org/jira/projects/BEAM)
    Beam website (https://beam.apache.org/)
    Source code (https://github.com/apache/beam)
    Developers mailing list ([email protected])
    Users mailing list ([email protected])

    View Slide

  54. Thanks!
    You can find me at: @sb2nov
    Questions?

    View Slide