Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Beam

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Apache Beam

This presentation gives an overview of the Apache Beam project. It shows that it is a means of developing generic data pipelines in multiple languages using provided SDK's. The pipelines execute on a range of supported runners/executors.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

Avatar for Mike Frampton

Mike Frampton

May 21, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Beam ? • A unified programming model

    • To define and execute data processing pipelines • For ETL, batch and stream • Open source / Apache 2.0 license • Written in Java, Python, Go • Cross platform support • Pipelines define using Beam SDK's
  2. How Does Beam Work ? • Use provided SDK's to

    define pipelines • In Java, Python, Go • Beam SDK isolated in Docker container • Can be run by any execution runners • A supported group of runners execute the pipeline • Capability matrix defines – Relative capabilities of runners – See beam.apache.org for matrix
  3. Beam Programming Guide ? • A guide for user to

    create data pipelines • Examples in Java, Python, Go • Can design, create and test pipelines • Provides multi language functions for • Pcollections • Transforms • Pipeline I/O • Schemas • Data encoding / type safety • Windowing • Triggers • Metrics • State and Timers
  4. Beam Pipelines • When designing pipelines consider – Where data

    is stored – What does the data look like – What do you want to do with the data – What does your output data look like – Where should the data go • Use PCollection and PTransform functions to define pipelines
  5. Beam Runners • Supported Beam Runners are – Direct Runner

    (test and development ) – Apache Apex – Apache Flink – Apache Gearpump – Apache Hadoop MapReduce – Apache Nemo – Apache Samza – Apache Spark – Google Cloud Dataflow – Hazelcast Jet – IBM Streams – JStorm
  6. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  7. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration