Apache Samza

What Is Apache Samza ? • An asynchronous computational framework
• For distributed sub second stream processing • Fault tolerance, isolation and stateful processing • Open source / Apache 2.0 license • Developed in Java and Scala • Runs stand-alone or on YARN

Samza Use Cases • Applications that require millisecond - second
response – Streaming analytics – DDOS attack detection – Fraud detection – Metric anomaly detection – System notifications – Performance monitoring

Samza Users

Samza Partitioned Stream • Samza uses streams to process data
• Collections of ordered immutable objects • Each object uses a key-value pair • Each stream is sharded into partitions • This allows the architecture to scale

Samza API's • High Level Streams API (Java) – Stream
based processing API • Low Level Task API (Java) – Message based processing API • Table API – Random access by key data sources • Testing Samza – Samza's testing Integration framework • Samza SQL – Stream processing via SQL and UDF's • Apache BEAM – Samza provides a Beam runner for application execution

Samza Architecture

Samza Architecture • Application are broken down into tasks •
Each task consumes data from a stream partition • Tasks are executed with containers • A coordinator assigns tasks to containers • Tasks checkpoint their last processed task offset • Each task has its own state store for state management • Samza replicates changes to local store in separate stream • This allows later recovery of local stores

Samza Architecture • Task container coordination

Samza Architecture • Fault tolerance of state

Samza Architecture • Incremental checkpointing

Samza Architecture • State management

Available Books • See “Big Data Made Easy” – Apress
Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
• See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Samza

Apache Samza

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

What Is Apache Samza ? • An asynchronous computational framework

Samza Use Cases • Applications that require millisecond - second

Samza Users

Samza Partitioned Stream • Samza uses streams to process data

Samza API's • High Level Streams API (Java) – Stream

Samza Architecture

Samza Architecture • Application are broken down into tasks •

Samza Architecture • Task container coordination

Samza Architecture • Fault tolerance of state

Samza Architecture • Incremental checkpointing

Samza Architecture • State management

Available Books • See “Big Data Made Easy” – Apress

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020