Turning the database inside out with Apache Samza

Martin Kleppmann

September 18, 2014

Programming

55k

Turning the database inside out with Apache Samza

Slides of a talk given on 18 September 2014 at Strange Loop, St Louis, MO.

Video: https://www.youtube.com/watch?v=fU9hR3kiOK0&list=PLeKd45zvjcDHJxge6VtYUAbYnvd_VNQCx

Abstract:

Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?

A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts. You can query it at some point in time — but that’s still old, imperative style thinking. A more fruitful approach is to take the streams of facts as they come in, and functionally process them in real-time.

This talk introduces Apache Samza, a distributed stream processing framework developed at LinkedIn. At first it looks like yet another tool for computing real-time analytics, but it’s more than that. Really it’s a surreptitious attempt to take the database architecture we know, and turn it inside out.

At its core is a distributed, durable commit log, implemented by Apache Kafka. Layered on top are simple but powerful tools for joining streams and managing large amounts of data reliably.

What we have to gain from turning the database inside out? Simpler code, better scalability, better robustness, lower latency, and more flexibility for doing interesting things with data. After this talk, you’ll see the architecture of your own applications in a completely new light.

Martin Kleppmann

September 18, 2014

More Decks by Martin Kleppmann

See All by Martin Kleppmann

Collaborative text editing with Eg-walker: Better, faster, smaller

ept

470

Byzantine Eventual Consistency and Local-first Access Control

ept

710

The past, present, and future of local-first

ept

2.3k

Where local-first came from and where it's going

ept

4.4k

Byzantine fault tolerance for peer-to-peer collaboration

ept

1.3k

New algorithms for collaborative text editing

ept

1.3k

Creating local-first collaboration software with Automerge

ept

2.8k

Collaborative editing through a databases lens

ept

2.5k

Making CRDTs Byzantine fault tolerant

ept

Other Decks in Programming

See All in Programming

A full stack side project webapp all in Kotlin (KotlinConf 2025)

dankim

150

脱Riverpod？fqueryで考える、TanStack Queryライクなアーキテクチャの可能性

ostk0069

530

Model Pollution

hschwentner

160

リッチエディターを安全に開発・運用するために

unachang113

140

CDK引数設計道場100本ノック

badmintoncryer

520

はじめてのWeb API体験ー飲食店検索アプリを作ろうー

akinko_0915

150

コーディングエージェント概観(2025/07)

itsuki_t88

100

Goで作る、開発・CI環境

sin392

270

マッチングアプリにおけるフリックUIで苦労したこと

yuheiito

220

AI時代の『改訂新版良いコード／悪いコードで学ぶ設計入門』 / ai-good-code-bad-code

minodriven

9.9k

Claude Code派？Gemini CLI派？みんなで比較LT会！_20250716

junholee

640

「テストは愚直&&網羅的に書くほどよい」という誤解 / Test Smarter, Not Harder

munetoshi

210

Featured

See All Featured

Designing Experiences People Love

moore

142

24k

The MySQL Ecosystem @ GitHub 2015

samlambert

251

13k

I Don’t Have Time: Getting Over the Fear to Launch Your Podcast

jcasabona

2.4k

"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)

danielanewman

229

22k

jQuery: Nuts, Bolts and Bling

dougneiner

7.8k

Reflections from 52 weeks, 52 projects

jeffersonlam

351

21k

Easily Structure & Communicate Ideas using Wireframe

afnizarnur

194

16k

Evolution of real-time – Irina Nazarova, EuRuKo, 2024

irinanazarova

840

Docker and Python

trallard

3.5k

Performance Is Good for Brains [We Love Speed 2024]

tammyeverts

980

Why Our Code Smells

bkeepers

PRO

337

57k

Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End

smashingmag

251

21k

Transcript

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
κ
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
References / further reading •  Martin Kleppmann: “Rethinking caching in
web apps.” 1 October 2012. http:// martin.kleppmann.com/2012/10/01/rethinking-caching-in-web-apps.html •  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly, to appear in 2015. http:// dataintensive.net/ •  Jay Kreps: “The Log: What every software engineer should know about real-time data's unifying abstraction.” 16 December 2013. http://engineering.linkedin.com/distributed-systems/log-what- every-software-engineer-should-know-about-real-time-datas-unifying •  Jay Kreps: “Questioning the Lambda Architecture.” 2 July 2014. http://radar.oreilly.com/2014/07/ questioning-the-lambda-architecture.html •  Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http:// radar.oreilly.com/2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html •  Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data systems.” Manning MEAP, to appear January 2015. http://manning.com/marz/ •  Apache Samza documentation. http://samza.incubator.apache.org/ •  Alexandros Labrinidis, Qiong Luo, Jie Xu, and Wenwei Xue: “Caching and Materialization for Web Databases,” Foundations and Trends in Databases, volume 2, number 3, pages 169–266, March 2010. •  Stefano Ceri, Georg Gottlob, and Letizia Tanca: “What You Always Wanted to Know About Datalog (And Never Dared to Ask),” IEEE Transactions on Knowledge and Data Engineering, volume 1, number 1, pages 146–166, March 1989.
None

Turning the database inside out with Apache Samza

Turning the database inside out with Apache Samza

Martin Kleppmann

More Decks by Martin Kleppmann

Other Decks in Programming

Featured

Transcript

κ

References / further reading • Martin Kleppmann: “Rethinking caching in

References / further reading •  Martin Kleppmann: “Rethinking caching in