Slides from a talk given at Strata+Hadoop World New York, 30 September 2015. http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/42723
Even the best data scientist can't do anything if they cannot easily get access to the necessary data. Simply making the data available is step 1 towards becoming a data-driven organization. In this talk, we'll explore how Apache Kafka can replace slow, fragile ETL processes with real-time data pipelines, and discuss best practices for data formats and integration with existing systems.
Apache Kafka is a popular open source message broker for high-throughput real-time event data, such as user activity logs or IoT sensor data. It originated at LinkedIn, where it reliably handles around a trillion messages per day.
What is less widely known: Kafka is also well suited for extracting data from existing databases, and making it available for analysis or for building data products. Unlike slow batch-oriented ETL, Kafka can make database data available to consumers in real-time, while also allowing efficient archiving to HDFS, for use in Spark, Hadoop or data warehouses.
When data science and product teams can process operational data in real-time, and combine it with user activity logs or sensor data, that turns out to be a potent mixture. Having all the data centrally available in a stream data platform is an exciting enabler for data-driven innovation.
In this talk, we will discuss what a Kafka-based stream data platform looks like, and how it is useful:
* Examples of the kinds of problems you can solve with Kafka
* Extracting real-time data feeds from databases, and sending them to Kafka
* Using Avro for schema management and future-proofing your data
* Designing your data pipelines to be resilient, but also flexible and amenable to change
220.127.116.11 - - [27/Feb/2015:17:55:11 +0000] "GET
/css/typography.css HTTP/1.1” 200 3377 "http://martin.
kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_5) AppleWebKit/537.36 (KHTML, like Gecko)
1. Jay Kreps: “Putting Apache Kafka to use: A practical guide to building a stream data platform
(part 1).” 25 February 2015. http://blog.conﬂuent.io/2015/02/25/stream-data-platform-1/
2. Gwen Shapira: “The problem of managing schemas,” 4 November 2014. http://
3. Martin Kleppmann: “Schema evolution in Avro, Protocol Buffers and Thrift,” 5 December
4. Martin Kleppmann: “Bottled Water: Real-time integration of PostgreSQL and Kafka.” 23
April 2015. http://blog.conﬂuent.io/2015/04/23/bottled-water-real-time-integration-of-
5. Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http://
6. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM
Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-
Discount code: TS2015
50% off ebooks