Data liberation and data integration with Kafka

September 30, 2015

3.4k

Data liberation and data integration with Kafka

Slides from a talk given at Strata+Hadoop World New York, 30 September 2015. http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/42723

Abstract:

Even the best data scientist can't do anything if they cannot easily get access to the necessary data. Simply making the data available is step 1 towards becoming a data-driven organization. In this talk, we'll explore how Apache Kafka can replace slow, fragile ETL processes with real-time data pipelines, and discuss best practices for data formats and integration with existing systems.

Apache Kafka is a popular open source message broker for high-throughput real-time event data, such as user activity logs or IoT sensor data. It originated at LinkedIn, where it reliably handles around a trillion messages per day.

What is less widely known: Kafka is also well suited for extracting data from existing databases, and making it available for analysis or for building data products. Unlike slow batch-oriented ETL, Kafka can make database data available to consumers in real-time, while also allowing efficient archiving to HDFS, for use in Spark, Hadoop or data warehouses.

When data science and product teams can process operational data in real-time, and combine it with user activity logs or sensor data, that turns out to be a potent mixture. Having all the data centrally available in a stream data platform is an exciting enabler for data-driven innovation.

In this talk, we will discuss what a Kafka-based stream data platform looks like, and how it is useful:

* Examples of the kinds of problems you can solve with Kafka
* Extracting real-time data feeds from databases, and sending them to Kafka
* Using Avro for schema management and future-proofing your data
* Designing your data pipelines to be resilient, but also flexible and amenable to change

Martin Kleppmann

September 30, 2015

More Decks by Martin Kleppmann

See All by Martin Kleppmann

Collaborative text editing with Eg-walker: Better, faster, smaller

ept

480

Byzantine Eventual Consistency and Local-first Access Control

ept

730

The past, present, and future of local-first

ept

2.3k

Where local-first came from and where it's going

ept

4.4k

Byzantine fault tolerance for peer-to-peer collaboration

ept

1.3k

New algorithms for collaborative text editing

ept

1.3k

Creating local-first collaboration software with Automerge

ept

2.8k

Collaborative editing through a databases lens

ept

2.5k

Making CRDTs Byzantine fault tolerant

ept

Other Decks in Programming

See All in Programming

The Niche of CDK Grant オブジェクトって何者？/the-niche-of-cdk-what-isgrant-object

hassaku63

690

AI時代の『改訂新版良いコード／悪いコードで学ぶ設計入門』 / ai-good-code-bad-code

minodriven

10k

知って得する@cloudflare_vite-pluginのあれこれ

chimame

120

なぜあなたのオブザーバビリティ導入は頓挫するのか

ryota_hnk

360

初学者でも今すぐできる、Claude Codeの生産性を10倍上げるTips

s4yuba

13k

レベル1の開発生産性向上に取り組む − 日々の作業の効率化・自動化を通じた改善活動

kesoji

350

AIともっと楽するE2Eテスト

myohei

3.1k

iOS開発スターターキットの作り方

akidon0000

180

ISUCON研修おかわり会講義スライド

arfes0e2b3c

480

AWS Summit Japan 2024と2025の比較／はじめてのKiro、今あなたは岐路に立つ

satoshi256kbyte

240

PHPUnitの限界をPlaywrightで補完するテストアプローチ

yuzneri

270

中級グラフィックス入門～効率的なメッシュレット描画～

projectasura

1.2k

Featured

See All Featured

Raft: Consensus for Rubyists

vanstee

140

Documentation Writing (for coders)

carmenintech

4.9k

Build your cross-platform service in a week with App Engine

jlugia

231

18k

Imperfection Machines: The Place of Print at Facebook

scottboms

267

13k

個人開発の失敗を避けるイケてる考え方 / tips for indie hackers

panda_program

108

19k

The Success of Rails: Ensuring Growth for the Next 100 Years

eileencodes

7.5k

Facilitating Awesome Meetings

lara

6.5k

Fireside Chat

paigeccino

3.5k

Improving Core Web Vitals using Speculation Rules API

sergeychernyshev

It's Worth the Effort

185

28k

Chrome DevTools: State of the Union 2024 - Debugging React & Beyond

addyosmani

760

GitHub's CSS Performance

jonrohan

1031

460k

Transcript

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
216.58.210.78 - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1” 200 3377
"http://martin. kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
References 1.  Jay Kreps: “Putting Apache Kafka to use: A
practical guide to building a stream data platform (part 1).” 25 February 2015. http://blog.conﬂuent.io/2015/02/25/stream-data-platform-1/ 2.  Gwen Shapira: “The problem of managing schemas,” 4 November 2014. http:// radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html 3.  Martin Kleppmann: “Schema evolution in Avro, Protocol Buffers and Thrift,” 5 December 2012. http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers- thrift.html 4.  Martin Kleppmann: “Bottled Water: Real-time integration of PostgreSQL and Kafka.” 23 April 2015. http://blog.conﬂuent.io/2015/04/23/bottled-water-real-time-integration-of- postgresql-and-kafka/ 5.  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http:// dataintensive.net 6.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18- das.pdf
Ofﬁce hours: 5.25pm today O’Reilly Booth Expo Hall
Discount code: TS2015 50% off ebooks

Data liberation and data integration with Kafka

Data liberation and data integration with Kafka

Martin Kleppmann

More Decks by Martin Kleppmann

Other Decks in Programming

Featured

Transcript

216.58.210.78 - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1” 200 3377

References 1. Jay Kreps: “Putting Apache Kafka to use: A

Ofﬁce hours: 5.25pm today O’Reilly Booth Expo Hall

Discount code: TS2015 50% off ebooks

References 1.  Jay Kreps: “Putting Apache Kafka to use: A