Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Event Sourcing and Stream Processing at Scale

Event Sourcing and Stream Processing at Scale

Slides from a talk given at DDD Europe, Brussels, Belgium, 29 January 2016.


If an idea is good, different communities will independently come up with it, but give it different names. For example, the ideas of Event Sourcing and CQRS emerged from the DDD community, while similar ideas appeared under the title of Stream Processing in internet companies such as LinkedIn, Twitter and Google.

This talk attempts to bridge those communities, and works out the commonalities and differences between Event Sourcing and Stream Processing, so that we can all learn from each other.

We will discuss lessons learnt from applying event-based architectures at large scale (over 10 million messages per second) at LinkedIn, and how such systems are implemented using the open source distributed messaging projects Apache Kafka and Apache Samza. We'll also discuss some of the architectural choices that affect scalability (both scalability in terms of data throughput, as well as organisational scalability).

Martin Kleppmann

January 29, 2016

More Decks by Martin Kleppmann

Other Decks in Programming


  1. View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. {
    eventType: PageViewEvent,
    3mestamp: 1413215518,
    viewerId: 1234,
    sessionId: 646cf6694c550a24,
    pageKey: profile-view,
    viewedProfileId: 4321,
    trackingKey: invita3on-email,
    ... etc. metadata about what content was displayed...

    View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. {
    eventType: PageViewEvent,
    3mestamp: 1413215518,
    viewerId: 1234,
    sessionId: 646cf6694c550a24,
    pageKey: profile-view,
    viewedProfileId: 4321,
    trackingKey: invita3on-email,
    ... etc. metadata about what content was displayed...

    View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. {
    eventType: ProfileEditEvent,
    3mestamp: 1413215518,
    profileId: 1234,
    old: {
    loca3on: "London, UK",
    industry: "Financial Services"},
    new: {
    loca3on: "Brussels, Belgium",
    industry: "SoUware"}

    View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. Kafka at scale
    •  LinkedIn: 1.1 trillion (1.1⨯1012) events per day
    peak: 18 M events/sec, 3.8 GB/sec
    •  Netflix: 400 billion (4⨯1011) events per day
    peak: 8 M events/sec, 17 GB/sec
    •  Uber, Twitter, Yahoo, Spotify, etc.
    •  http://www.confluent.io/blog/apache-kafka-hits-1.1-trillion-messages-per-day-joins-the-4-comma-club (Sept 2015)
    https://engineering.linkedin.com/kafka/running-kafka-scale (March 2015)
    https://engineering.linkedin.com/blog/2016/01/whats-new-samza (January 2016)
    http://www.slideshare.net/wangxia5/netflix-kafka (March 2015)

    View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. View Slide

  63. View Slide

  64. View Slide

  65. View Slide

  66. View Slide

  67. View Slide

  68. View Slide

  69. View Slide

  70. View Slide

  71. View Slide

  72. View Slide

  73. View Slide

  74. View Slide

  75. View Slide

  76. View Slide

  77. View Slide

  78. View Slide

  79. View Slide

  80. View Slide

  81. View Slide

  82. View Slide

  83. View Slide

  84. References
    1.  Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “The Dataflow Model: A Practical Approach to Balancing
    Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing,” Proceedings of the VLDB
    Endowment, volume 8, number 12, pages 1792–1803, August 2015. http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
    2.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud
    Computing (SoCC), October 2012. http://www.socc2012.org/s18-das.pdf
    3.  Pat Helland: “Immutability Changes Everything,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR),
    January 2015. http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
    4.  Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data systems.” Manning,
    April 2015, ISBN 9781617290343. http://manning.com/marz/
    5.  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http://dataintensive.net
    6.  Martin Kleppmann and Jay Kreps: “Kafka, Samza and the Unix philosophy of distributed data.” IEEE Data Engineering
    Bulletin, December 2015. http://martin.kleppmann.com/papers/kafka-debull15.pdf
    7.  Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://radar.oreilly.com/
    8.  Jay Kreps: “Questioning the Lambda Architecture.” July 2014. http://radar.oreilly.com/2014/07/questioning-the-lambda-
    9.  Jay Kreps: “I ♥︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/0636920034339.do
    10.  Praveen Neppalli Naga: “Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http://

    View Slide

  85. View Slide