Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building real-time data products at LinkedIn with Apache Samza

Building real-time data products at LinkedIn with Apache Samza

Presented at Strata+Hadoop World, New York, 16 October 2014 http://strataconf.com/stratany2014/public/schedule/detail/36045

Video: https://www.youtube.com/watch?v=yO3SBU6vVKA&list=PLeKd45zvjcDHJxge6VtYUAbYnvd_VNQCx


The world is going real-time. MapReduce, SQL-on-Hadoop and similar batch processing tools are fine for analyzing and processing data after the fact — but sometimes you need to process data continuously as it comes in, and react to it within a few seconds or less. How do you do that at Hadoop scale?

Apache Samza is an open source stream processing framework designed to solve these kinds of problems. It is built upon YARN/Hadoop 2.0 and Apache Kafka. You can think of Samza as a real-time, continuously running version of MapReduce.

Samza has some unique features that make it powerful. It provides high performance for stateful processing jobs, including aggregation and joins between many input streams. It is designed to support an ecosystem of many different jobs written by different teams, and it isolates them from each other, so that one badly behaved job can’t affect the others.

At LinkedIn, we have been using Samza in production for some time, both for internal analytics purposes and for data products that are served on the live site. In this talk, we’ll discuss our experience of working with Samza. You’ll learn about:

- What kinds of real-time data problems you can solve with Samza
- How Samza reliably scales to millions of messages per second
- How Samza compares to other stream processing frameworks
- How Samza can help collaboration between different data science, product, and engineering teams within an organization
- How to avoid implementing the same data pipeline twice (once for offline/batch processing and once for real-time/stream processing)
- Lessons we learnt on how to structure real-time data pipelines for scale and flexibility

Martin Kleppmann

October 16, 2014

More Decks by Martin Kleppmann

Other Decks in Programming


  1. View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. {  
     eventType:    PageViewEvent,  
     3mestamp:    1413215518,  
     viewerId:      1234,  
     sessionId:    fa1afe101234deadbeef,  
     pageKey:      profile-­‐view,  
     viewedProfileId:  4321,  
     trackingKey:    invita3on-­‐email,  
     ...  etc.  metadata  about  what  content  was  displayed...  

    View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. {  
     eventType:    PageViewEvent,  
     3mestamp:    1413215518,  
     viewerId:      1234,  
     sessionId:    fa1afe101234deadbeef,  
     pageKey:      profile-­‐view,  
     viewedProfileId:  4321,  
     trackingKey:    invita3on-­‐email,  
     ...  etc.  metadata  about  what  content  was  displayed...  

    View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. {  
     eventType:    ProfileEditEvent,  
     3mestamp:    1413215518,  
     profileId:      1234,  
     old:  {  
       loca3on:    "San  Francisco,  CA",  
       industry:    "Internet"},  
     new:  {  
       loca3on:    "New  York,  NY",  
       industry:    "Financial  Services"}  

    View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. View Slide

  63. View Slide

  64. View Slide

  65. View Slide

  66. References (fun stuff to read)

    1.  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear in 2015. http://dataintensive.net

    2.  Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://radar.oreilly.com/2014/07/why-local-

    3.  Jay Kreps: “I ♥︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/0636920034339.do

    4.  Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data systems.” Manning MEAP, to appear
    January 2015. http://manning.com/marz/

    5.  Jakob Homan: “Real time insights into LinkedIn's performance using Apache Samza.” 18 Aug 2014. http://engineering.linkedin.com/samza/

    6.  Martin Kleppmann: “Moving faster with data streams: The rise of Samza at LinkedIn.” 14 July 2014. http://engineering.linkedin.com/stream-

    7.  Praveen Neppalli Naga: “Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http://engineering.linkedin.com/analytics/real-

    8.  David He: “Monitor and Improve Web Performance Using RUM Data Visualization.” 19 Sept 2014. http://engineering.linkedin.com/

    9.  Lili Wu, Sam Shah, Sean Choi, Mitul Tiwari, and Christian Posse: “The Browsemaps: Collaborative Filtering at LinkedIn,” at 6th Workshop
    on Recommender Systems and the Social Web, Oct 2014. http://ls13-www.cs.uni-dortmund.de/homepage/rsweb2014/papers/

    10.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC),
    October 2012. http://www.socc2012.org/s18-das.pdf

    11.  Apache Samza documentation. http://samza.incubator.apache.org

    View Slide

  67. View Slide