Save 37% off PRO during our Black Friday Sale! »

Patterns for real-time stream processing

Patterns for real-time stream processing

Slides from my talk at Crunch Conference, Budapest, Hungary, 30 October 2015.

Abstract:

You have some streams of data, such as user activity on a website, or sensor readings from devices. Now you want to process the data and make it useful with low latency: for example, generating real-time recommendations, detecting abuse, filtering spam or predicting demand. And you want it to scale well.

Perhaps you’ve heard of distributed stream processing frameworks such as Samza, Storm or Spark Streaming, which may do what you want, but you’re not sure how to use them most effectively.

This talk will introduce some common design patterns for working with high-volume, real-time data streams. We will look at things like joining, enriching, filtering and aggregating streaming data, and we’ll explore how you might break down an application into streaming operators that do what you want.

References:

1. Apache Samza documentation. http://samza.apache.org

2. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing,” Proceedings of the VLDB Endowment, volume 8, number 12, pages 1792–1803, August 2015. http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

3. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-das.pdf

4. Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data systems.” Manning, April 2015, ISBN 9781617290343. http://manning.com/marz/

5. Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http://dataintensive.net

6. Martin Kleppmann: “Moving faster with data streams: The rise of Samza at LinkedIn.” 14 July 2014. http://engineering.linkedin.com/stream-processing/moving-faster-data-streams-rise-samza-linkedin

7. Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://radar.oreilly.com/2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html

8. Jay Kreps: “I ♥︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/0636920034339.do

9. Praveen Neppalli Naga: “Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot

10. Lili Wu, Sam Shah, Sean Choi, Mitul Tiwari, and Christian Posse: “The Browsemaps: Collaborative Filtering at LinkedIn,” at 6th Workshop on Recommender Systems and the Social Web, Oct 2014. http://ls13-www.cs.uni-dortmund.de/homepage/rsweb2014/papers/rsweb2014_submission_3.pdf

0d4ef9af8e4f0cf5c162b48ba24faea6?s=128

Martin Kleppmann

October 30, 2015
Tweet

Transcript

  1. None
  2. None
  3. None
  4. None
  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. { eventType: PageViewEvent, 3mestamp: 1413215518, viewerId: 1234, sessionId: fa1afe101234deadbeef, pageKey:

    profile-view, viewedProfileId: 4321, trackingKey: invita3on-email, ... etc. metadata about what content was displayed... }
  13. None
  14. None
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. { eventType: PageViewEvent, 3mestamp: 1413215518, viewerId: 1234, sessionId: fa1afe101234deadbeef, pageKey:

    profile-view, viewedProfileId: 4321, trackingKey: invita3on-email, ... etc. metadata about what content was displayed... }
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. { eventType: ProfileEditEvent, 3mestamp: 1413215518, profileId: 1234, old: { loca3on:

    "London, UK", industry: "Financial Services"}, new: { loca3on: "Budapest, Hungary", industry: "SoTware"} }
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. https://github.com/ept/newsfeed

  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. None
  60. None
  61. None
  62. None
  63. None
  64. None
  65. None
  66. None
  67. None
  68. None
  69. None
  70. None
  71. None
  72. None
  73. None
  74. None
  75. None
  76. None
  77. None
  78. None
  79. None
  80. None
  81. None
  82. None
  83. None
  84. None
  85. None
  86. None
  87. None
  88. None
  89. None
  90. References 1.  Apache Samza documentation. http://samza.apache.org 2.  Tyler Akidau, Robert

    Bradshaw, Craig Chambers, et al.: “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing,” Proceedings of the VLDB Endowment, volume 8, number 12, pages 1792–1803, August 2015. http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf 3.  Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012. http://www.socc2012.org/s18-das.pdf 4.  Nathan Marz and James Warren: “Big Data: Principles and best practices of scalable realtime data systems.” Manning, April 2015, ISBN 9781617290343. http://manning.com/marz/ 5.  Martin Kleppmann: “Designing data-intensive applications.” O’Reilly Media, to appear. http://dataintensive.net 6.  Martin Kleppmann: “Moving faster with data streams: The rise of Samza at LinkedIn.” 14 July 2014. http:// engineering.linkedin.com/stream-processing/moving-faster-data-streams-rise-samza-linkedin 7.  Jay Kreps: “Why local state is a fundamental primitive in stream processing.” 31 July 2014. http://radar.oreilly.com/ 2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html 8.  Jay Kreps: “I ♥︎ Logs.” O'Reilly Media, September 2014. http://shop.oreilly.com/product/0636920034339.do 9.  Praveen Neppalli Naga: “Real-time Analytics at Massive Scale with Pinot.” 29 Sept 2014. http:// engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot 10.  Lili Wu, Sam Shah, Sean Choi, Mitul Tiwari, and Christian Posse: “The Browsemaps: Collaborative Filtering at LinkedIn,” at 6th Workshop on Recommender Systems and the Social Web, Oct 2014. http://ls13-www.cs.uni-dortmund.de/ homepage/rsweb2014/papers/rsweb2014_submission_3.pdf
  91. None