Apache Beam

What Is Apache Beam ? • A unified programming model
• To define and execute data processing pipelines • For ETL, batch and stream • Open source / Apache 2.0 license • Written in Java, Python, Go • Cross platform support • Pipelines define using Beam SDK's

How Does Beam Work ? • Use provided SDK's to
define pipelines • In Java, Python, Go • Beam SDK isolated in Docker container • Can be run by any execution runners • A supported group of runners execute the pipeline • Capability matrix defines – Relative capabilities of runners – See beam.apache.org for matrix

Beam Programming Guide ? • A guide for user to
create data pipelines • Examples in Java, Python, Go • Can design, create and test pipelines • Provides multi language functions for • Pcollections • Transforms • Pipeline I/O • Schemas • Data encoding / type safety • Windowing • Triggers • Metrics • State and Timers

Beam Pipelines • When designing pipelines consider – Where data
is stored – What does the data look like – What do you want to do with the data – What does your output data look like – Where should the data go • Use PCollection and PTransform functions to define pipelines

Beam Example Pipelines

Beam Runners • Supported Beam Runners are – Direct Runner
(test and development ) – Apache Apex – Apache Flink – Apache Gearpump – Apache Hadoop MapReduce – Apache Nemo – Apache Samza – Apache Spark – Google Cloud Dataflow – Hazelcast Jet – IBM Streams – JStorm

Beam Capability Matrix – What Computed

Beam Capability Matrix – Where Computed

Beam Capability Matrix – When Computed

Beam Capability Matrix – How Computed

Available Books • See “Big Data Made Easy” – Apress
Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
• See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Beam

Apache Beam

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

What Is Apache Beam ? • A unified programming model

How Does Beam Work ? • Use provided SDK's to

Beam Programming Guide ? • A guide for user to

Beam Pipelines • When designing pipelines consider – Where data

Beam Example Pipelines

Beam Example Pipelines

Beam Runners • Supported Beam Runners are – Direct Runner

Beam Capability Matrix – What Computed

Beam Capability Matrix – Where Computed

Beam Capability Matrix – When Computed

Beam Capability Matrix – How Computed

Available Books • See “Big Data Made Easy” – Apress

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020