Scalable stream processing with Apache Kafka and Apache Samza

Scalable stream processing with Apache Kafka and Apache Samza

I have given this talk (with minor variations) at the following venues:

• ApacheCon EU, Budapest, Hungary, 18 November 2014. http://apacheconeu2014.sched.org/event/3633e195715f88c3357749d57b7b3b8c
• Unified Log London Meetup, London, UK, 2 December 2014. http://www.meetup.com/unified-log-london/events/218025352/
• Jfokus, Stockholm, Sweden, 4 February 2015. https://martin.kleppmann.com/2015/02/04/samza-at-jfokus.html

Abstract:

Samza, an Apache Incubator project, is a framework for processing and analysing high-volume data streams. It is built upon Apache Kafka and YARN (Hadoop 2.0). You can think of Samza as a real-time, continuously running version of MapReduce.

In this talk, Martin will show why stream processing is becoming an important part of the architecture of data-intensive applications, alongside storage and batch processing. We will explore how Samza works, and show how it reliably processes millions of messages per second. We will also examine what kinds of applications would benefit from using Samza.

0d4ef9af8e4f0cf5c162b48ba24faea6?s=128

Martin Kleppmann

November 18, 2014
Tweet

Transcript

  1. None
  2. None
  3. None
  4. None
  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. { eventType: PageViewEvent, timestamp: 1413215518, viewerId: 1234, sessionId: fa1afe101234deadbeef, pageKey:

    profile-view, viewedProfileId: 4321, trackingKey: invitation-email, …metadata about displayed content… }
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. { eventType: PageViewEvent, timestamp: 1413215518, viewerId: 1234, sessionId: fa1afe101234deadbeef, pageKey:

    profile-view, viewedProfileId: 4321, trackingKey: invitation-email, …metadata about displayed content… }
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. key = urn:linkedin:profile:1234 value = { eventType: ProfileEditEvent, timestamp: 1413215518,

    profile: { location: “Cambridge, UK”, industry: “Software”, positions: [ {job_title: “Author”, company: “O’Reilly”}, … ]}}
  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. key = urn:linkedin:profile:1234 value = { eventType: ProfileEditEvent, timestamp: 1413215518,

    profile: { location: “Cambridge, UK”, industry: “Software”, positions: [ {job_title: “Author”, company: “O’Reilly”}, … ]}}
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. References (fun stuff to read) 1.  Martin Kleppmann: “Designing data-intensive

    applications.” O’Reilly Media, to appear in 2015. http:// dataintensive.net 2.  Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http:// radar.oreilly.com/2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html 3.  Jay Kreps: “I ♥︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/ 0636920034339.do 4.  Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data systems.” Manning MEAP, to appear January 2015. http://manning.com/marz/ 5.  Jakob Homan: “Real time insights into LinkedIn's performance using Apache Samza.” 18 Aug 2014. http://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samza 6.  Martin Kleppmann: “Moving faster with data streams: The rise of Samza at LinkedIn.” 14 July 2014. http://engineering.linkedin.com/stream-processing/moving-faster-data-streams-rise-samza-linkedin 7.  Praveen Neppalli Naga: “Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http:// engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot 8.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-das.pdf 9.  Apache Samza documentation. http://samza.incubator.apache.org 10. Alan Woodward and Martin Kleppmann: “Samza-Luwak Proof of Concept.” 10 November 2014. https://github.com/romseygeek/samza-luwak
  57. None