Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Turning the database inside out with Apache Samza

Turning the database inside out with Apache Samza

Slides of a talk given on 18 September 2014 at Strange Loop, St Louis, MO.

Video: https://www.youtube.com/watch?v=fU9hR3kiOK0&list=PLeKd45zvjcDHJxge6VtYUAbYnvd_VNQCx


Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?

A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts. You can query it at some point in time — but that’s still old, imperative style thinking. A more fruitful approach is to take the streams of facts as they come in, and functionally process them in real-time.

This talk introduces Apache Samza, a distributed stream processing framework developed at LinkedIn. At first it looks like yet another tool for computing real-time analytics, but it’s more than that. Really it’s a surreptitious attempt to take the database architecture we know, and turn it inside out.

At its core is a distributed, durable commit log, implemented by Apache Kafka. Layered on top are simple but powerful tools for joining streams and managing large amounts of data reliably.

What we have to gain from turning the database inside out? Simpler code, better scalability, better robustness, lower latency, and more flexibility for doing interesting things with data. After this talk, you’ll see the architecture of your own applications in a completely new light.

Martin Kleppmann

September 18, 2014

More Decks by Martin Kleppmann

Other Decks in Programming


  1. None
  2. None
  3. None
  4. None
  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. None
  60. None
  61. κ

  62. None
  63. None
  64. None
  65. None
  66. None
  67. None
  68. None
  69. None
  70. None
  71. None
  72. None
  73. None
  74. None
  75. None
  76. None
  77. None
  78. None
  79. None
  80. None
  81. None
  82. References / further reading •  Martin Kleppmann: “Rethinking caching in

    web apps.” 1 October 2012. http:// martin.kleppmann.com/2012/10/01/rethinking-caching-in-web-apps.html •  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly, to appear in 2015. http:// dataintensive.net/ •  Jay Kreps: “The Log: What every software engineer should know about real-time data's unifying abstraction.” 16 December 2013. http://engineering.linkedin.com/distributed-systems/log-what- every-software-engineer-should-know-about-real-time-datas-unifying •  Jay Kreps: “Questioning the Lambda Architecture.” 2 July 2014. http://radar.oreilly.com/2014/07/ questioning-the-lambda-architecture.html •  Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http:// radar.oreilly.com/2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html •  Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data systems.” Manning MEAP, to appear January 2015. http://manning.com/marz/ •  Apache Samza documentation. http://samza.incubator.apache.org/ •  Alexandros Labrinidis, Qiong Luo, Jie Xu, and Wenwei Xue: “Caching and Materialization for Web Databases,” Foundations and Trends in Databases, volume 2, number 3, pages 169–266, March 2010. •  Stefano Ceri, Georg Gottlob, and Letizia Tanca: “What You Always Wanted to Know About Datalog (And Never Dared to Ask),” IEEE Transactions on Knowledge and Data Engineering, volume 1, number 1, pages 146–166, March 1989.
  83. None