Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data liberation and data integration with Kafka

Data liberation and data integration with Kafka

Slides from a talk given at Strata+Hadoop World New York, 30 September 2015. http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/42723


Even the best data scientist can't do anything if they cannot easily get access to the necessary data. Simply making the data available is step 1 towards becoming a data-driven organization. In this talk, we'll explore how Apache Kafka can replace slow, fragile ETL processes with real-time data pipelines, and discuss best practices for data formats and integration with existing systems.

Apache Kafka is a popular open source message broker for high-throughput real-time event data, such as user activity logs or IoT sensor data. It originated at LinkedIn, where it reliably handles around a trillion messages per day.

What is less widely known: Kafka is also well suited for extracting data from existing databases, and making it available for analysis or for building data products. Unlike slow batch-oriented ETL, Kafka can make database data available to consumers in real-time, while also allowing efficient archiving to HDFS, for use in Spark, Hadoop or data warehouses.

When data science and product teams can process operational data in real-time, and combine it with user activity logs or sensor data, that turns out to be a potent mixture. Having all the data centrally available in a stream data platform is an exciting enabler for data-driven innovation.

In this talk, we will discuss what a Kafka-based stream data platform looks like, and how it is useful:

* Examples of the kinds of problems you can solve with Kafka
* Extracting real-time data feeds from databases, and sending them to Kafka
* Using Avro for schema management and future-proofing your data
* Designing your data pipelines to be resilient, but also flexible and amenable to change

Martin Kleppmann

September 30, 2015

More Decks by Martin Kleppmann

Other Decks in Programming


  1. View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. - - [27/Feb/2015:17:55:11 +0000] "GET

    /css/typography.css HTTP/1.1” 200 3377 "http://martin.

    kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X

    10_9_5) AppleWebKit/537.36 (KHTML, like Gecko)

    Chrome/40.0.2214.115 Safari/537.36"

    View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. View Slide

  63. View Slide

  64. View Slide

  65. View Slide

  66. View Slide

  67. View Slide

  68. View Slide

  69. View Slide

  70. View Slide

  71. References

    1.  Jay Kreps: “Putting Apache Kafka to use: A practical guide to building a stream data platform
    (part 1).” 25 February 2015. http://blog.confluent.io/2015/02/25/stream-data-platform-1/

    2.  Gwen Shapira: “The problem of managing schemas,” 4 November 2014. http://

    3.  Martin Kleppmann: “Schema evolution in Avro, Protocol Buffers and Thrift,” 5 December
    2012. http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-

    4.  Martin Kleppmann: “Bottled Water: Real-time integration of PostgreSQL and Kafka.” 23
    April 2015. http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of-

    5.  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http://

    6.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM
    Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-

    View Slide

  72. Office hours:

    5.25pm today

    O’Reilly Booth

    Expo Hall

    View Slide

  73. Discount code: TS2015

    50% off ebooks

    View Slide