Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Amsterdam Java User Group - "Why and when shoul...

Amsterdam Java User Group - "Why and when should we consider stream processing frameworks in our solutions"

Key Highlights:
🚀 Explore the fundamentals of Stream Processing.
🌐 Learn about the tools and frameworks available.
📈 Discover success stories and benefits of Stream Processing.
🤔 Understand when to consider Stream Processing for your projects.
🔍 Find valuable resources to kickstart your journey.

Avatar for Soroosh Khodami

Soroosh Khodami

November 30, 2023
Tweet

More Decks by Soroosh Khodami

Other Decks in Programming

Transcript

  1. Agenda What is Stream Processing? Frameworks & Platforms Basic Concepts

    & Patterns Preview/Demo Benefits & Drawbacks + Considerations Use Cases For Different Industries How to start ?
  2. This Talk is For Software Developers Tech Leads / Software

    Architects Data Engineers / Data Scientist / AI Engineers Product Owners / Product Managers / Business Analysts
  3. $ whoami ▪ I’m Soroosh Khodami ▪ Solution Architect @

    Rabobank via Code Nomads ▪ Worked with Stream Processing at Scale in Bol.com ▪ Software Architecture Enthusiastic @SorooshKh linkedin.com/in/sorooshkhodami/ Slides & Code Repository Link Will Be Shared At The End
  4. Stream (Data) Processing Stream processing is a big data technique

    that focuses on continuously reading data, processing the data individually or joining it with related data sets in real-time or near real- time, and then sending the output to other applications, data-stores, or systems.
  5. { "eventId": "987654321", "eventType": "PaymentStatusUpdated", "timestamp": "2023-09-19T15:45:00Z", "paymentId": "123456789", "status":

    "Paid", "amount": 75.0, "orderId": "543210987" } { "id": "Cust123", "name": "Alice Smith", "email": "...", "city": "Los Angeles", "gender": "Female", "dateOfBirth": "1985-08-10" } Decision Making + Updating DB Payment Result Customer { "orderId": "Order789", "customerId": "Cust123", "orderTotal": 150.0, "orderDate": "2023-09-20T10:30:00Z", "productName": "Gadget", "productCategory": "Electronics" } Order Event Processing Stream Processing Enrichment / JOIN { "orderId": "Order456", "orderTotal": 100.0, "orderDate": "2023-09-19T16:00:00Z", "customerCity": "New York", "customerGender": "Female", "customerDateOfBirth": "1990-05-15", "productName": "Widget", "productCategory": "Electronics" } OrderWithCustomerData
  6. Stream Processing Universe 2023 Code will be executed on a

    Runner Standalone / Alongside other frameworks
  7. Bounded Stream / Unbounded Stream Time Now Past Future Unbounded

    Stream Bounded Stream #1 Start End Time Now Past Future Bounded Stream #2 Start End
  8. Event Time & Processing Time Processing Time Event Time 1

    Login 1 2 3 4 5 6 7 2 Search 3 View 4 View 5 View 6 Play 1 Login 2 Search 3 View 4 View 5 View 6 Play 1 2 3 4 5 6 7
  9. Delivery Guarantees Learn More (Important) Streaming Concepts - Exactly Once

    Fault Tolerance Guarantees youtube.com/watch?v=9pRsewtSPkQ Rundown of Flink's Checkpoints - youtube.com/watch?v=hoLeQjoGBkQ Understanding exactly-once processing and windowing in streaming pipelines - youtube.com/watch?v=DraQGkARegE At Most Once At Least Once Exactly Once Messages can be lost, but never duplicated (Fire & Forget) Messages can be duplicated Messages are delivered & processed exactly once
  10. IoT Farm Context ▪ +1000 Sensors ▪ Multiple Sensors per

    location ▪ Not reliable internet connection ▪ Large amount of continuous sensors data Requirements ▪ Aggregated Sensors Data Per Location ▪ Correct Order Of Data ▪ No Duplicates/Double Processing
  11. Read Soil Moisture Sensors Operators & Transform Sink IOT Farm

    Example Operator(s) Operator(s) Read Optical Sensors Read Temperature Sensors Filter Selected Locations Join & Aggregate Operator(s) Operator(s)
  12. Operators & Transform Images From: http://ibmstreams.github.io/streamsx.documentation/docs/spl/quick-start/qs-2/ Analyzing tweets using Cloud

    Dataflow pipeline templates https://cloud.google.com/blog/products/gcp/analyzing-tweets-using-cloud-dataflow-pipeline-templates/
  13. Time 5 4 4 1 7 2 2 6 4

    1 Windowing Sum: 19 Count: 5 2 3 6 4 4 7 2 2 6 4 1 2 • Divides an unbounded, continuous data stream into smaller, finite segments • Allows to perform operations and calculations on manageable chunks of data. • It’s not feasible to load/keep entire stream into memory • Useful for analyzing data over specific time periods or fixed numbers of events. Window of Data Learn More Basics of Windowing - https://www.youtube.com/watch?v=oJ-LueBvOcM&t=1s Advanced Windowing Concepts - https://www.youtube.com/watch?v=MuFA6CSti6M
  14. Time 5 4 4 1 7 2 2 6 4

    1 5 seconds Time Based Windows No Overlaps between windows elements Tumbling/Fixed Window 5 1 4 7 2 4 5 seconds 5 seconds 4 2 1 Sum:11 Count: 4 Sum: 19 Count: 5 Sum: 5 Count: 2 Time 5 2 3 4 4 1 7 2 2 6 4 1 Size Based Windows 5 2 3 1 4 7 2 4 4 2 6 1 Sum: 11 Count: 4 Sum: 17 Count: 4 Sum: 13 Count: 4 2 3 2 3 Time 5 2 3 4 4 1 7 2 2 6 4 1 Time & Size Based Windows 5 2 3 1 4 7 2 4 4 2 6 1 Sum: 11 Count: 4 Sum: 17 Count: 4 Sum: 7 Count: 3 5 seconds 5 seconds 5 seconds
  15. Sliding Window Time Success Success Success Success Success Error WARN

    WARN Error WARN Window #1 Window #2 Window #3 Window #N Window #N+1 Time Based Error Error Error Error Error Error Error Error Success : 4 Warn : 0 Error : 0 Success : 3 Warn : 0 Error : 1 Success : 1 Warn : 2 Error : 1 ……….. Success : 0 Warn : 0 Error : 4 Last 10 Second Every 5 Seconds + Overlaps Between Windows
  16. Session Window Time User #1 Play Heartbeat Heart Beat Seek

    Seek Heartbeat Seek Heart Beat Heartbeat Heartbeat Seek Pause Window #1 Window #2 10 sec User #2 Play Heartbeat Heart Beat Seek Heartbeat Heartbeat Window #1 Window #2 20 sec Close the window based on GAP Duration = 10 sec
  17. Watermarks 1 2 3 4 7 Window #1 Window #2

    5 seconds 5 seconds 1 2 3 4 7 Window #1 Window #2 5 seconds 5 seconds 4 Learn More Basics of Windowing - https://www.youtube.com/watch?v=oJ-LueBvOcM&t=1s Advanced Windowing Concepts - https://www.youtube.com/watch?v=MuFA6CSti6M
  18. Basic Concepts & Patterns ✓ Bounded Stream / Unbounded Stream

    ✓ Operators & Transforms ✓ Event Time & Processing Time ✓ Event Delivery Guarantee ✓ Windowing ( Fixed , Sliding, Session, Watermark ) ❑ States & Stateful Stream Processing ❑ Joining Streams & Enrichment Pattern
  19. Learn More Stream Join in Flink: from Discrete to Continuous

    - Xingcan Cui https://www.youtube.com/watch?v=3YVRluJUKIw Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf - https://www.youtube.com/watch?v=cJS18iKLUIY 2 5 3 2 1 2 1 3 4 5 Temperature Sensor Stream Moisture Sensor Stream Window Window Inner Join 2 1 1 2 Window Cross Join (CoGroup) 3 2 1 5 2 1 Joining Streams & Enrichment Pattern
  20. States & Stateful Stream Processing Learn More Introduction to Stateful

    Stream Processing with Apache Flink - Robert Metzger https://www.youtube.com/watch?v=DkNeyCW-eH0 Webinar: Deep Dive on Apache Flink State - Seth Wiesman - https://www.youtube.com/watch?v=9GF8Hwqzwnk State Stateful Operator Streams Stateless Operator Stateless Operator Stateless Operator Stateless Operator Stateless Operator Stateless Operator Stateful Operator Stateless Operator Stateless Operator Stateless Operator State
  21. States & Stateful Stream Processing Login Attempts State: Last Threshold

    Breach : Nullable Read Windowing Last 15 Minutes Count Enrich With Previous Breache and Update Last Breach Group By IP Brute Force Login Monitoring Sink Security Alerts Learn More Introduction to Stateful Stream Processing with Apache Flink - Robert Metzger https://www.youtube.com/watch?v=DkNeyCW-eH0 Webinar: Deep Dive on Apache Flink State - Seth Wiesman - https://www.youtube.com/watch?v=9GF8Hwqzwnk Login Attempts Login Attempts Filter Above Threshold
  22. Group By Key / KeyBy [4Geeks] Play Heartbeat Heart Beat

    Seek Seek Heartbeat Seek Heart Beat Heartbeat Heartbeat Seek Group By Action Play Play Play Group By Customer Seek Heartbeat Heartbeat Heartbeat Seek Play Play Learn More Apache Flink Specifying Keys https://medium.com/big-data-processing/apache-flink-specifying-keys-81b3b651469 Branching & merging PCollections with Apache Beam - https://youtu.be/RYD40js20a4
  23. Order Enrichment With Customer Data [4Geeks] Apache Beam + Dataflow

    vs Spring Boot + Redis Customers Events (CDC) Orders Events Enriched Orders With Customer Data Enrich Order Data Code Repository & Slides @SorooshKh
  24. Insights 1 Dataflow Worker with Default Spec 120k message processed

    in 3 minutes Apache Beam + Dataflow Order Enrichment Test Results Note: Please note that the insights provided above are not derived from a fully accurate benchmark. ~ 700 msg/second Higher Costs For Keeping Job Running Tested on Minimum Kubernetes Hardware on GCP 120k message processed in 5 minutes Spring Boot ~ 400 msg/second Lower Costs For Keeping Job Running
  25. Order Enrichment With Customer Data [4Geeks] Customer CDC Read Enrich

    Order With Customer Data Sink EnrichedOrder Orders Read Store Customer in Redis Get Customer Information from Redis Spring Boot + Redis
  26. Order Enrichment With Customer Data [4Geeks] Customer CDC State: Customer

    Read CoGroupByKey EnrichOrderWithCusto merData Sink EnrichedOrder Orders Read KeyBy CustomerID KeyBy CustomerID Update Customer in State Customer(123) (123, Customer(123)) (123, Customer(123)) Order(1005, CustomerId =123) (123, Order(1005, CustomerId=123)) (123, Order(1005, CustomerId=123)) OrderWithCustomerData - Order - Customer Learn More Stream Join in Flink: from Discrete to Continuous - Xingcan Cui https://www.youtube.com/watch?v=3YVRluJUKIw Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf - https://www.youtube.com/watch?v=cJS18iKLUIY Apache Beam + Dataflow
  27. Benefits & Drawbacks ✓ Fast & High-Throughput ✓ Easy to

    Scale ✓ Exactly Once Processing / Fault Tolerant ✓ Customizable ✓ Advanced features in scale: Windowing, Watermarks, Stateful Functions and .. ✖ Complexity ✖ Implementation & Maintenance ✖ Testing & Debugging is challenging ✖ Changing the data pipelines are hard ✖ Error handling is not simple ✖ Data consistency is not easy Drawbacks Benefits Stream Processing Frameworks
  28. Stream Data Integration vs Stream Analytics Learn More Stream Processing

    – Concepts and Frameworks (Guido Schmutz, Switzerland) https://www.youtube.com/watch?v=vFshGQ2ndeg | https://www.slideshare.net/gschmutz/introduction-to-stream-processing-132881199 (Stream ETL) Stream Data Integration Stream Analytics ▪ Reading Input ▪ Map ▪ Filter ▪ Simple Enrich ▪ Stateful Processing ▪ Pattern Matching ▪ Complex Joins / Aggregations
  29. Considerations Learn More ( Important ) Apache Flink Worst Practices

    - Konstantin Knauf - https://www.youtube.com/watch?v=F7HQd3KX2TQ Learning Curve Project Timeline Hard to Find Developer Limited Docs/Resources Community Support Costs Stream Data Integration 1 – 2 Weeks Stream Analytics 2 – 3 Months 3 – 4 Engineers 4 – 6 Months 0 -> Stability Cloud Providers Helps a Bit
  30. DECISION MAKING FACTORS Requirements (FRs + NFRs + Roadmap) Development

    Cost (Capex) Maintenance Cost (Opex) Complexity Limitations Industry Best Practices
  31. When should we consider it in our solutions? Case: Stream

    Data Integration Context / Conditions
  32. When should we consider it in our solutions? Case: Stream

    Data Integration Context / Conditions • Events / second < 1K • Experience of Stream processing : No • Business queries are changing frequently • Time to market : Very tight • 3 – 4 Mid-Senior Developers Learn More Apache Flink Worst Practices - Konstantin Knauf https://www.youtube.com/watch?v=F7HQd3KX2TQ Note: The cases incorporated within this presentation are designed to demonstrate the reasoning process.
  33. When should we consider it in our solutions? Learn More

    Apache Flink Worst Practices - Konstantin Knauf https://www.youtube.com/watch?v=F7HQd3KX2TQ Context / Conditions Case: Stream Analytics • Events / second > 10K • Experience of Stream processing : No • Business queries are clear and not changing frequently • Real time/near real time insights are crucial ? Yes • 3 – 4 Mid-Senior Developers Note: The cases incorporated within this presentation are designed to demonstrate the reasoning process.
  34. Usecases Video Streaming Playback Analytics IOT GPS Tracking Telecom Billing

    / Charging System Finance Fraud Detection E-Commerce User Analytics Gaming Industry Anti-Cheat
  35. Video Platforms Use cases Playback Analytics Content Provider Shares Pay

    Per Minute Fraud Detection Personalized Recommendation Learn More Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis youtube.com/watch?v=lC0d3gAPXaI Custom, Complex Windows at Scale using Apache Flink - Matt Zimmer (Netflix) youtube.com/watch?v=XUvqnsWm8yo SF 2017: Monal Daxini - Stream Processing with Flink at Netflix youtube.com/watch?v=sPB8w-YXX1s Real-time Processing with Flink for Machine Learning at Netflix - Elliot Chow youtube.com/watch?v=o4C7TDneH00
  36. Gaming Industry Use cases Learn More Kafka and Big Data

    Streaming Use Cases in the Gaming Industry https://www.confluent.io/online-talks/kafka-and-big-data-streaming-use-cases-in-the- gaming-industry/ Let's Play Flink – Fun with Streaming in a Gaming Company https://www.youtube.com/watch?v=8BNKEmt47UM Game Telemetry Analytics Rewards (In-Game) Live In-Game Changes (NPC, Quests, .. ) IoT Integration Loyalty Service Anti-Cheat Chat Service Monitoring Match Making Payment Fraud Detection In-Game Recommendation Advertiseme AI Training Payment
  37. Application Analytics Use cases Learn More Implementing Google Analytics: A

    Case Study - Making Sense of Stream Processing by Martin Kleppmann https://www.oreilly.com/library/view/making-sense-of/9781492042563/ch01.html Martin Kleppmann — Event Sourcing and Stream Processing at Scale https://www.youtube.com/watch?v=avi-TZI9t2I Singles Day 2018: Data in a Flink of an eye https://www.ververica.com/blog/singles-day-2018-data-in-a-flink-of-an-eye
  38. Learn More 7 Reasons to use Apache Flink for your

    IoT Project https://www.youtube.com/watch?v=Q0LBTmT4W9o Fleet management / GPS Tracking Anomaly detection Smart home automation Energy management Environmental monitoring Predictive maintenance Self-Driving Cars Internet Of Things Use cases
  39. Billing Network Optimization Security Fraud Detection Learn More Maciej Próchniak

    - Stream processing in telco - case study based on Apache Flink & TouK Nussknacker @ Devoxx Poland https://www.youtube.com/watch?v=WLfEB__fM-4 Telecommunication Use cases
  40. Fraud detection Algorithmic trading Risk management Real-time portfolio analysis Customer

    analytics Regulatory compliance Profit & Lost Insights Learn More Real Time Fraud Detection with Stateful Functions https://www.youtube.com/watch?v=RxDlksbsdQ0 Fast Data at ING - Martijn Visser & Bas Geerdink (ING) https://www.youtube.com/watch?v=e-_6gijUGAw Stream ING Models – Real time model deployment of ML Capabilities https://www.youtube.com/watch?v=Do7C4UJyWCM Financial Systems Use cases
  41. How to start learning? [1] https://youtu.be/65lmwL7rSy4 [2] https://youtube.com/playlist?list=PL8bzd7vku-WhVHzJgmXoCxx3aB4PxTQLP [3] https://beamsummit.org/

    [3] https://www.flink-forward.org/ [4] https://beam.apache.org/documentation/ [4] https://nightlies.apache.org/flink/flink-docs-stable/ 1 2 3 4 IMPORTANT NOTE Creating a Stream Processing service isn't as straightforward as crafting CRUD APIs. Relying solely on Google, development tools, Stackoverflow, and copy-pasting won't get you far. It's crucial to dedicate ample time to thoroughly learn and understand the underlying concepts. Google Cloud Apache Beam Debi Cabrera Apache Beam Step By Step Atul Raina BEAM SUMMIT & FLINK FORWARD Official Documentation
  42. Slides & Code Repository Any Question ? Send me a

    message on twitter or Linkedin Thanks for your Attention! @SorooshKh linkedin.com/in/sorooshkhodami/