Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data liberation and data integration with Kafka

Data liberation and data integration with Kafka

Slides from a talk given at Strata+Hadoop World New York, 30 September 2015. http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/42723


Even the best data scientist can't do anything if they cannot easily get access to the necessary data. Simply making the data available is step 1 towards becoming a data-driven organization. In this talk, we'll explore how Apache Kafka can replace slow, fragile ETL processes with real-time data pipelines, and discuss best practices for data formats and integration with existing systems.

Apache Kafka is a popular open source message broker for high-throughput real-time event data, such as user activity logs or IoT sensor data. It originated at LinkedIn, where it reliably handles around a trillion messages per day.

What is less widely known: Kafka is also well suited for extracting data from existing databases, and making it available for analysis or for building data products. Unlike slow batch-oriented ETL, Kafka can make database data available to consumers in real-time, while also allowing efficient archiving to HDFS, for use in Spark, Hadoop or data warehouses.

When data science and product teams can process operational data in real-time, and combine it with user activity logs or sensor data, that turns out to be a potent mixture. Having all the data centrally available in a stream data platform is an exciting enabler for data-driven innovation.

In this talk, we will discuss what a Kafka-based stream data platform looks like, and how it is useful:

* Examples of the kinds of problems you can solve with Kafka
* Extracting real-time data feeds from databases, and sending them to Kafka
* Using Avro for schema management and future-proofing your data
* Designing your data pipelines to be resilient, but also flexible and amenable to change


Martin Kleppmann

September 30, 2015

More Decks by Martin Kleppmann

Other Decks in Programming


  1. None
  2. None
  3. None
  4. None
  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1” 200 3377

    "http://martin. kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. None
  60. None
  61. None
  62. None
  63. None
  64. None
  65. None
  66. None
  67. None
  68. None
  69. None
  70. None
  71. References 1.  Jay Kreps: “Putting Apache Kafka to use: A

    practical guide to building a stream data platform (part 1).” 25 February 2015. http://blog.confluent.io/2015/02/25/stream-data-platform-1/ 2.  Gwen Shapira: “The problem of managing schemas,” 4 November 2014. http:// radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html 3.  Martin Kleppmann: “Schema evolution in Avro, Protocol Buffers and Thrift,” 5 December 2012. http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers- thrift.html 4.  Martin Kleppmann: “Bottled Water: Real-time integration of PostgreSQL and Kafka.” 23 April 2015. http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of- postgresql-and-kafka/ 5.  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http:// dataintensive.net 6.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18- das.pdf
  72. Office hours: 5.25pm today O’Reilly Booth Expo Hall

  73. Discount code: TS2015 50% off ebooks