$30 off During Our Annual Pro Sale. View Details »

Scalable stream processing with Apache Kafka and Apache Samza

Scalable stream processing with Apache Kafka and Apache Samza

I have given this talk (with minor variations) at the following venues:

• ApacheCon EU, Budapest, Hungary, 18 November 2014. http://apacheconeu2014.sched.org/event/3633e195715f88c3357749d57b7b3b8c
• Unified Log London Meetup, London, UK, 2 December 2014. http://www.meetup.com/unified-log-london/events/218025352/
• Jfokus, Stockholm, Sweden, 4 February 2015. https://martin.kleppmann.com/2015/02/04/samza-at-jfokus.html

Abstract:

Samza, an Apache Incubator project, is a framework for processing and analysing high-volume data streams. It is built upon Apache Kafka and YARN (Hadoop 2.0). You can think of Samza as a real-time, continuously running version of MapReduce.

In this talk, Martin will show why stream processing is becoming an important part of the architecture of data-intensive applications, alongside storage and batch processing. We will explore how Samza works, and show how it reliably processes millions of messages per second. We will also examine what kinds of applications would benefit from using Samza.

Martin Kleppmann

November 18, 2014
Tweet

More Decks by Martin Kleppmann

Other Decks in Programming

Transcript

  1. View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. {


    eventType:


    PageViewEvent,


    timestamp:



    1413215518,


    viewerId:


    1234,


    sessionId:


    fa1afe101234deadbeef,


    pageKey:


    profile-view,


    viewedProfileId:
    4321,


    trackingKey:

    invitation-email,


    …metadata about displayed content…

    }

    View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. {


    eventType:


    PageViewEvent,


    timestamp:



    1413215518,


    viewerId:


    1234,


    sessionId:


    fa1afe101234deadbeef,


    pageKey:


    profile-view,


    viewedProfileId:
    4321,


    trackingKey:

    invitation-email,


    …metadata about displayed content…

    }

    View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. key = urn:linkedin:profile:1234

    value = {


    eventType:

    ProfileEditEvent,


    timestamp:


    1413215518,


    profile: {



    location:

    “Cambridge, UK”,



    industry:

    “Software”,



    positions: [




    {job_title: “Author”, company: “O’Reilly”},








    ]}}

    View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. key = urn:linkedin:profile:1234

    value = {


    eventType:

    ProfileEditEvent,


    timestamp:


    1413215518,


    profile: {



    location:

    “Cambridge, UK”,



    industry:

    “Software”,



    positions: [




    {job_title: “Author”, company: “O’Reilly”},








    ]}}

    View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. References (fun stuff to read)

    1.  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear in 2015. http://
    dataintensive.net

    2.  Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://
    radar.oreilly.com/2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html

    3.  Jay Kreps: “I ♥︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/
    0636920034339.do

    4.  Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data
    systems.” Manning MEAP, to appear January 2015. http://manning.com/marz/

    5.  Jakob Homan: “Real time insights into LinkedIn's performance using Apache Samza.” 18 Aug 2014.
    http://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-apache-samza

    6.  Martin Kleppmann: “Moving faster with data streams: The rise of Samza at LinkedIn.” 14 July 2014.
    http://engineering.linkedin.com/stream-processing/moving-faster-data-streams-rise-samza-linkedin

    7.  Praveen Neppalli Naga: “Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http://
    engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot

    8.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on
    Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-das.pdf

    9.  Apache Samza documentation. http://samza.incubator.apache.org

    10. Alan Woodward and Martin Kleppmann: “Samza-Luwak Proof of Concept.” 10 November 2014.
    https://github.com/romseygeek/samza-luwak

    View Slide

  57. View Slide