Big data processing with Apache Beam

Big data processing with Apache Beam

In this talk, we present the new Python SDK for Apache Beam - a parallel programming model that allows one to implement batch and streaming data processing jobs that can run on a variety of execution engines like Apache Spark and Google Cloud Dataflow. We will use examples to discuss some of the interesting challenges in providing a Pythonic API and execution environment for distributed processing.

0b40b3c621633157be039d55d0fd9ea0?s=128

Sourabh

July 06, 2017
Tweet

Transcript

  1. Data Processing with Apache Beam Pydata Seattle 2017

  2. I am Sourabh Hello!

  3. I am Sourabh Hello! I am a Software Engineer

  4. I am Sourabh Hello! I am a Software Engineer I

    tweet at @sb2nov
  5. I am Sourabh Hello! I am a Software Engineer I

    tweet at @sb2nov I like Ice Cream
  6. What is Apache Beam?

  7. Apache Beam is a unified programming model for expressing efficient

    and portable data processing pipelines
  8. Big Data

  9. https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg LAUNCH!!

  10. DATA CAN BE BIG

  11. … REALLY BIG ... Tuesday Wednesday Thursday

  12. UNBOUNDED, DELAYED, OUT OF ORDER 9:00 8:00 14:00 13:00 12:00

    11:00 10:00 8:00 8:00 8:00
  13. ORGANIZING THE STREAM 8:00 8:00 8:00

  14. DATA PROCESSING TRADEOFFS Completeness Latency $$$ Cost

  15. WHAT IS IMPORTANT? Completeness Low Latency Low Cost Important Not

    Important $$$
  16. MONTHLY BILLING Completeness Low Latency Low Cost Important Not Important

    $$$
  17. BILLING ESTIMATE Completeness Low Latency Low Cost Important Not Important

    $$$
  18. FRAUD DETECTION Completeness Low Latency Low Cost Important Not Important

    $$$
  19. Beam Model

  20. Pipeline PTransform PCollection (bounded or unbounded)

  21. EVENT TIME VS PROCESSING TIME

  22. ASKING THE RIGHT QUESTIONS When in processing time? What is

    being computed? Where in event time? How do refinements happen?
  23. WHAT IS BEING COMPUTED? scores: PCollection[KV[str, int]] = (input |

    Sum.integersPerKey())
  24. WHAT IS BEING COMPUTED?

  25. WHERE IN EVENT TIME? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60)) | Sum.integersPerKey())
  26. WHERE IN EVENT TIME?

  27. WHERE IN EVENT TIME?

  28. WHEN IN PROCESSING TIME? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark())) | Sum.integersPerKey())
  29. WHEN IN PROCESSING TIME?

  30. HOW DO REFINEMENTS HAPPEN? scores: PCollection[KV[str, int]] = (input |

    beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(1 * 60)) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) | Sum.integersPerKey())
  31. HOW DO REFINEMENTS HAPPEN?

  32. CUSTOMIZING WHAT WHERE WHEN HOW Classic Batch Windowed Batch Streaming

    Streaming + Accumulation For more information see https://cloud.google.com/dataflow/examples/gaming-example
  33. Examples

  34. WORD COUNT http://www.levraphael.com/blog/wp-content/uploads/2015/06/word-pile.jpg

  35. WORD COUNT import apache_beam as beam, re

  36. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt"))
  37. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)))
  38. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement())
  39. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement() | beam.Map(lambda (w, c): "%s: %d" % (w, c)))
  40. WORD COUNT import apache_beam as beam, re with beam.Pipeline() as

    p: (p | beam.io.textio.ReadFromText("input.txt") | beam.FlatMap(lamdba s: re.split("\\W+", s)) | beam.combiners.Count.PerElement() | beam.Map(lambda (w, c): "%s: %d" % (w, c)) | beam.io.textio.WriteToText("output/stringcounts"))
  41. TRENDING ON TWITTER http://thegetsmartblog.com/wp-content/uploads/2013/06/Twitter-trends-feature.png

  42. TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic"))

  43. TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic")

    | beam.WindowInto(SlidingWindows(5*60, 1*60)))
  44. TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic")

    | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()))
  45. TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic")

    | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()) | beam.combiners.Count.PerElement())
  46. TRENDING ON TWITTER with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic")

    | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()) | beam.combiners.Count.PerElement() | beam.ParDo(BigQueryOutputFormatDoFn()) | beam.io.WriteToBigQuery("trends_table"))
  47. Portability & Vision Google Cloud Dataflow

  48. Other Languages Beam Java Beam Python Pipeline SDK User facing

    SDK, defines a language specific API for the end user to specify the pipeline computation DAG.
  49. Runner API Other Languages Beam Java Beam Python Runner API

    Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.
  50. Runner API Other Languages Beam Java Beam Python Execution Execution

    Execution SDK Harness Docker based execution environments that are shared by all runners for running the user code in a consistent environment.
  51. Fn API Runner API Other Languages Beam Java Beam Python

    Execution Execution Execution Fn API API which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.
  52. Fn API Apache Flink Apache Spark Runner API Other Languages

    Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gear- pump Apache Apex Runner Distributed processing environments that understand the runner API graph and how to execute the Beam model primitives.
  53. More Beam? Issue tracker (https://issues.apache.org/jira/projects/BEAM) Beam website (https://beam.apache.org/) Source code

    (https://github.com/apache/beam) Developers mailing list (dev-subscribe@beam.apache.org) Users mailing list (user-subscribe@beam.apache.org)
  54. Thanks! You can find me at: @sb2nov Questions?