Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Beam

Apache Beam

This presentation gives an overview of the Apache Beam project. It shows that it is a means of developing generic data pipelines in multiple languages using provided SDK's. The pipelines execute on a range of supported runners/executors.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

Mike Frampton

May 21, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Beam ? • A unified programming model

    • To define and execute data processing pipelines • For ETL, batch and stream • Open source / Apache 2.0 license • Written in Java, Python, Go • Cross platform support • Pipelines define using Beam SDK's
  2. How Does Beam Work ? • Use provided SDK's to

    define pipelines • In Java, Python, Go • Beam SDK isolated in Docker container • Can be run by any execution runners • A supported group of runners execute the pipeline • Capability matrix defines – Relative capabilities of runners – See beam.apache.org for matrix
  3. Beam Programming Guide ? • A guide for user to

    create data pipelines • Examples in Java, Python, Go • Can design, create and test pipelines • Provides multi language functions for • Pcollections • Transforms • Pipeline I/O • Schemas • Data encoding / type safety • Windowing • Triggers • Metrics • State and Timers
  4. Beam Pipelines • When designing pipelines consider – Where data

    is stored – What does the data look like – What do you want to do with the data – What does your output data look like – Where should the data go • Use PCollection and PTransform functions to define pipelines
  5. Beam Runners • Supported Beam Runners are – Direct Runner

    (test and development ) – Apache Apex – Apache Flink – Apache Gearpump – Apache Hadoop MapReduce – Apache Nemo – Apache Samza – Apache Spark – Google Cloud Dataflow – Hazelcast Jet – IBM Streams – JStorm
  6. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  7. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration