$30 off During Our Annual Pro Sale. View Details »

Turning the database inside out with Apache Samza

Turning the database inside out with Apache Samza

Slides of a talk given on 18 September 2014 at Strange Loop, St Louis, MO.

Video: https://www.youtube.com/watch?v=fU9hR3kiOK0&list=PLeKd45zvjcDHJxge6VtYUAbYnvd_VNQCx


Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?

A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts. You can query it at some point in time — but that’s still old, imperative style thinking. A more fruitful approach is to take the streams of facts as they come in, and functionally process them in real-time.

This talk introduces Apache Samza, a distributed stream processing framework developed at LinkedIn. At first it looks like yet another tool for computing real-time analytics, but it’s more than that. Really it’s a surreptitious attempt to take the database architecture we know, and turn it inside out.

At its core is a distributed, durable commit log, implemented by Apache Kafka. Layered on top are simple but powerful tools for joining streams and managing large amounts of data reliably.

What we have to gain from turning the database inside out? Simpler code, better scalability, better robustness, lower latency, and more flexibility for doing interesting things with data. After this talk, you’ll see the architecture of your own applications in a completely new light.

Martin Kleppmann

September 18, 2014

More Decks by Martin Kleppmann

Other Decks in Programming


  1. View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. κ

    View Slide

  62. View Slide

  63. View Slide

  64. View Slide

  65. View Slide

  66. View Slide

  67. View Slide

  68. View Slide

  69. View Slide

  70. View Slide

  71. View Slide

  72. View Slide

  73. View Slide

  74. View Slide

  75. View Slide

  76. View Slide

  77. View Slide

  78. View Slide

  79. View Slide

  80. View Slide

  81. View Slide

  82. References / further reading

    •  Martin Kleppmann: “Rethinking caching in web apps.” 1 October 2012. http://

    •  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly, to appear in 2015. http://

    •  Jay Kreps: “The Log: What every software engineer should know about real-time data's unifying
    abstraction.” 16 December 2013. http://engineering.linkedin.com/distributed-systems/log-what-

    •  Jay Kreps: “Questioning the Lambda Architecture.” 2 July 2014. http://radar.oreilly.com/2014/07/

    •  Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://

    •  Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data
    systems.” Manning MEAP, to appear January 2015. http://manning.com/marz/

    •  Apache Samza documentation. http://samza.incubator.apache.org/

    •  Alexandros Labrinidis, Qiong Luo, Jie Xu, and Wenwei Xue: “Caching and Materialization for Web
    Databases,” Foundations and Trends in Databases, volume 2, number 3, pages 169–266, March 2010.

    •  Stefano Ceri, Georg Gottlob, and Letizia Tanca: “What You Always Wanted to Know About Datalog
    (And Never Dared to Ask),” IEEE Transactions on Knowledge and Data Engineering, volume 1, number
    1, pages 146–166, March 1989.

    View Slide

  83. View Slide