$30 off During Our Annual Pro Sale. View Details »

Unified Batch and Stream Processing with Apache Beam

Sourabh
August 13, 2017

Unified Batch and Stream Processing with Apache Beam

Talk at PyBay 2017

Sourabh

August 13, 2017
Tweet

More Decks by Sourabh

Other Decks in Technology

Transcript

  1. Unified Batch and Stream
    Processing with
    Apache Beam
    PyBay 2017

    View Slide

  2. I am Sourabh
    Hello!

    View Slide

  3. I am Sourabh
    Hello!
    I am a Software Engineer

    View Slide

  4. I am Sourabh
    Hello!
    I am a Software Engineer
    I tweet at @sb2nov

    View Slide

  5. What is Apache Beam?

    View Slide

  6. Apache Beam is a unified
    programming model for
    expressing efficient and
    portable data processing
    pipelines

    View Slide

  7. Big Data

    View Slide

  8. https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
    LAUNCH!!

    View Slide

  9. DATA CAN BE BIG

    View Slide

  10. … REALLY BIG ...
    Tuesday
    Wednesday
    Thursday

    View Slide

  11. UNBOUNDED, DELAYED, OUT OF
    ORDER
    9:00
    8:00 14:00
    13:00
    12:00
    11:00
    10:00
    8:00
    8:00
    8:00

    View Slide

  12. ORGANIZING THE STREAM
    8:00
    8:00
    8:00

    View Slide

  13. DATA PROCESSING TRADEOFFS
    Completeness Latency
    $$$
    Cost

    View Slide

  14. WHAT IS IMPORTANT?
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View Slide

  15. MONTHLY BILLING
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View Slide

  16. BILLING ESTIMATE
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View Slide

  17. FRAUD DETECTION
    Completeness Low Latency Low Cost
    Important
    Not Important
    $$$

    View Slide

  18. Beam
    Model

    View Slide

  19. GENERATIONS BEYOND MAP-REDUCE
    Clearly separates event time from
    processing time
    Improved abstractions let you focus
    on your application logic
    Batch and stream processing are both
    first-class citizens

    View Slide

  20. Pipeline
    PTransform
    PCollection
    (bounded or
    unbounded)

    View Slide

  21. EVENT TIME VS PROCESSING TIME

    View Slide

  22. EVENT TIME VS PROCESSING TIME

    View Slide

  23. EVENT TIME VS PROCESSING TIME
    Watermarks describe event time
    progress.
    "No timestamp earlier than the
    watermark will be seen"
    Often heuristic-based.
    Too Slow? Results are delayed.
    Too Fast? Some data is late.

    View Slide

  24. ASKING THE RIGHT QUESTIONS
    When in processing time?
    What is being computed?
    Where in event time?
    How do refinements happen?

    View Slide

  25. WHAT IS BEING COMPUTED?
    scores: PCollection[KV[str, int]] = (input
    | beam.CombinePerKey(sum))

    View Slide

  26. WHAT IS BEING COMPUTED?
    Element-Wise Aggregating Composite

    View Slide

  27. WHAT IS BEING COMPUTED?

    View Slide

  28. WHERE IN EVENT TIME?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60))
    | beam.CombinePerKey(sum))

    View Slide

  29. WHERE IN EVENT TIME?

    View Slide

  30. WHERE IN EVENT TIME?

    View Slide

  31. WHERE IN EVENT TIME?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60))
    | beam.CombinePerKey(sum))
    The choice of windowing is retained through subsequent aggregations.

    View Slide

  32. WHEN IN PROCESSING TIME?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60),
    triggerfn=trigger.AfterWatermark())
    | beam.CombinePerKey(sum))

    View Slide

  33. WHEN IN PROCESSING TIME?
    Triggers control when results are
    emitted.
    Triggers are often relative to the
    watermark.

    View Slide

  34. WHEN IN PROCESSING TIME?

    View Slide

  35. HOW DO REFINEMENTS HAPPEN?
    scores: PCollection[KV[str, int]] = (input
    | beam.WindowInto(FixedWindows(2 * 60),
    triggerfn=trigger.AfterWatermark(
    early=trigger.AfterPeriod(1*60),
    late=trigger.AfterCount(1)),
    accumulation_mode=ACCUMULATING)
    | beam.CombinePerKey(sum))

    View Slide

  36. HOW DO REFINEMENTS HAPPEN?

    View Slide

  37. CUSTOMIZING WHAT WHERE WHEN HOW
    Classic
    Batch
    Windowed
    Batch
    Streaming Streaming +
    Accumulation
    For more information see https://cloud.google.com/dataflow/examples/gaming-example

    View Slide

  38. Python SDK

    View Slide

  39. 39
    SIMPLE PIPELINE
    with beam.Pipeline() as p:
    Pipeline construction is deferred.

    View Slide

  40. 40
    SIMPLE PIPELINE
    with beam.Pipeline() as p:
    lines = p | beam.io.ReadTextFile('/path/to/files')
    lines is a PCollection, a deferred collection
    of all lines in the specified files.

    View Slide

  41. 41
    SIMPLE PIPELINE
    with beam.Pipeline() as p:
    lines = p | beam.io.ReadTextFile('/path/to/files')
    words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
    The "pipe" operator applies a transformation
    (on the right) to a PCollection, reminiscent
    of bash.
    This will be applied to
    each line, resulting in a
    PCollection of words.

    View Slide

  42. 42
    SIMPLE PIPELINE
    with beam.Pipeline() as p:
    lines = p | beam.io.ReadTextFile('/path/to/files')
    words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
    totals = (words
    | beam.Map(lambda w: (w, 1))
    | beam.CombinePerKey(sum))
    Operations can
    be chained.

    View Slide

  43. 43
    SIMPLE PIPELINE
    with beam.Pipeline() as p:
    lines = p | beam.io.ReadTextFile('/path/to/files')
    words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
    totals = words | Count()
    Composite operations
    easily defined.

    View Slide

  44. 44
    SIMPLE PIPELINE
    with beam.Pipeline() as p:
    lines = p | beam.io.ReadTextFile('/path/to/files')
    words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
    totals = words | Count()
    totals | beam.io.WriteTextFile('/path/to/output')
    (totals | beam.CombinePerKey(Largest(100))
    | beam.io.WriteTextFile('/path/to/another/output')
    Finally, write the
    results somewhere.
    The pipeline actually executes
    on exiting its context. Pipelines are DAGs in general.

    View Slide

  45. 45
    SIMPLE BATCH PIPELINE
    with beam.Pipeline() as p:
    lines = p | beam.io.ReadTextFile('/path/to/files')
    words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
    totals = words | Count()
    totals | beam.io.WriteTextFile('/path/to/output')
    (totals | beam.CombinePerKey(Largest(100))
    | beam.io.WriteTextFile('/path/to/another/output')

    View Slide

  46. 46
    WHAT ABOUT STREAMING?

    View Slide

  47. 47
    SIMPLE STREAMING PIPELINE
    with beam.Pipeline() as p:
    lines = p | beam.io.ReadPubSub(...) | WindowInto(...)
    words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
    totals = words | Count()
    totals | beam.io.WriteTextFile('/path/to/output')
    (totals | beam.CombinePerKey(Largest(100))
    | beam.io.WriteTextFile('/path/to/another/output')

    View Slide

  48. Demo

    View Slide

  49. WORD COUNT
    http://www.levraphael.com/blog/wp-content/uploads/2015/06/word-pile.jpg

    View Slide

  50. TRENDING ON TWITTER
    http://thegetsmartblog.com/wp-content/uploads/2013/06/Twitter-trends-feature.png

    View Slide

  51. Portability
    &
    Vision
    Google Cloud
    Dataflow

    View Slide

  52. WHAT DOES APACHE BEAM PROVIDE?
    Runners for Existing Distributed Processing Backends
    The Beam Model: What / Where / When / How
    API (SDKs) for writing Beam pipelines
    Apache Apex
    Apache Flink
    InProcess / Local
    Apache Spark
    Google Cloud Dataflow
    Apache GearPump

    View Slide

  53. Other
    Languages
    Beam
    Java
    Beam
    Python Pipeline SDK
    User facing SDK, defines a language
    specific API for the end user to
    specify the pipeline computation
    DAG.

    View Slide

  54. Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python Runner API
    Runner and language agnostic
    representation of the user’s pipeline
    graph. It only contains nodes of Beam
    model primitives that all runners
    understand to maintain portability
    across runners.

    View Slide

  55. Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Execution
    SDK Harness
    Docker based execution
    environments that are shared by all
    runners for running the user code in a
    consistent environment.

    View Slide

  56. Fn API
    Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Execution
    Fn API
    API which the execution
    environments use to send and receive
    data, report metrics around execution
    of the user code with the Runner.

    View Slide

  57. Fn API
    Apache
    Flink
    Apache
    Spark
    Runner API
    Other
    Languages
    Beam
    Java
    Beam
    Python
    Execution Execution
    Cloud
    Dataflow
    Execution
    Apache
    Gear-
    pump
    Apache
    Apex
    Runner
    Distributed processing environments
    that understand the runner API
    graph and how to execute the Beam
    model primitives.

    View Slide

  58. BEAM RUNNER CAPABILITIES
    https://beam.apache.org/capability-matrix/

    View Slide

  59. HOW CAN YOU HELP?

    View Slide

  60. Runner 3
    The Beam Model
    Language A
    SDK
    Language C
    SDK
    Runner 2
    Runner 1
    Software Development
    Kits (SDKs)
    Language B
    SDK
    Have a programming language you
    want to see in Beam; write an SDK.

    View Slide

  61. Runner 3
    The Beam Model
    Language A
    SDK
    Language C
    SDK
    Runner 2
    Runner 1
    Language B
    SDK
    Runners
    Google Cloud
    Dataflow
    Have an execution engine you want to
    see in Beam; write a runner.

    View Slide

  62. The Beam Model
    Language A
    SDK
    Language C
    SDK
    Language B
    SDK
    Domain Specific
    extensions (DSLs)
    Have a target audience you want to see
    using Beam; write a DSL.
    DSL 3
    DSL 2
    DSL 1

    View Slide

  63. The Beam Model
    Language A
    SDK
    Language C
    SDK
    Language B
    SDK
    Have shared components that can be
    part of larger pipelines; write a library.
    Library 3
    Library 2
    Library 1
    Transform Libraries

    View Slide

  64. The Beam Model
    Language A
    SDK
    Language C
    SDK
    Language B
    SDK
    Have a data storage or messaging
    system; write an IO connector.
    IO
    Connector
    IO
    Connector
    IO
    Connector
    IO Connectors

    View Slide

  65. MORE BEAM?
    Issue tracker (https://issues.apache.org/jira/projects/BEAM)
    Beam website (https://beam.apache.org/)
    Source code (https://github.com/apache/beam)
    Developers mailing list ([email protected])
    Users mailing list ([email protected])
    Follow @ApacheBeam on Twitter

    View Slide

  66. SUMMARY
    ● Beam helps you tackle big data that is:
    ○ Unbounded in volume
    ○ Out of order
    ○ Arbitrarily delayed
    ● The Beam model separates concerns of:
    ○ What is being computed?
    ○ Where in event time?
    ○ When in processing time?
    ○ How do refinements happen?

    View Slide

  67. Thanks!
    You can find me at: @sb2nov
    Questions?

    View Slide