Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

Viktor Gamov
September 13, 2017

[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

Data is flowing everywhere around us, from phones, credit cards, sensor-equipped buildings, vending machines, thermostats, trains, buses,planes, posts to social media, digital pictures and video and so on.Simple data collection is not enough anymore. Most of the current systems do data processing via nightly extract, transform, and load (ETL)operations, which is common in enterprise environments, requires decision makers to wait an entire day (or night) for reports to become available.

But businesses don’t want «Big Data» anymore. They want «Fast Data».What distinguishes a «streaming systems» from the batch systems is that the event stream is unbounded or “infinite” from a system perspective.

Decision-makers need to analyze these streaming events as a whole to make business decisions as new information arrives.In this talk, after a short introduction to common approaches and architectures (lambda, kappa), Viktor will demonstrate how to use open-source steam processing tools (Flink, Kafka Streams, Hazelcast Jet) for stream processing.

Viktor Gamov

September 13, 2017
Tweet

More Decks by Viktor Gamov

Other Decks in Programming

Transcript

  1. @gamussa @confluentinc @thephillyjug
    Divide, Distribute and Conquer:

    Stream v. Batch

    View full-size slide

  2. Stream v. Batch

    View full-size slide

  3. Solutions Architect
    Who am I?

    View full-size slide

  4. Solutions Architect
    Developer Advocate
    Who am I?

    View full-size slide

  5. Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Who am I?

    View full-size slide

  6. Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Hey you, yes, you, go follow me in twitter ©
    Who am I?

    View full-size slide

  7. @gamussa @confluentinc @thephillyjug
    Disclaimer:


    View full-size slide

  8. @gamussa @confluentinc @thephillyjug
    BATCH PROCESSING
    Data at rest

    View full-size slide

  9. @gamussa @confluentinc @thephillyjug
    Data and Queries
    Origin and processing

    View full-size slide

  10. @gamussa @confluentinc @thephillyjug

    View full-size slide

  11. @gamussa @confluentinc @thephillyjug
    Data…

    View full-size slide

  12. @gamussa @confluentinc @thephillyjug
    Data…

    View full-size slide

  13. @gamussa @confluentinc @thephillyjug
    ✓ … inherently immutable
    Data…
    ✓ … time-based

    View full-size slide

  14. @gamussa @confluentinc @thephillyjug
    CRUD -> CR

    View full-size slide

  15. @gamussa @confluentinc @thephillyjug
    Processing is a query

    View full-size slide

  16. @gamussa @confluentinc @thephillyjug
    Processing is a query
    Function on full data set

    View full-size slide

  17. @gamussa @confluentinc @thephillyjug
    Processing is a query
    Function on full data set
    Projection

    View full-size slide

  18. @gamussa @confluentinc @thephillyjug
    Processing is a query
    Function on full data set
    Projection
    Aggregations

    View full-size slide

  19. @gamussa @confluentinc @thephillyjug
    Processing is a query
    Function on full data set
    Projection
    Aggregations
    Joins

    View full-size slide

  20. SELECT
    user_vote, count(*)
    FROM AccessLog
    WHERE event_date
    BETWEEN"04/07/2017" AND "04/07/2017"
    GROUP BY user_vote;

    View full-size slide

  21. SELECT
    user_vote, count(*)
    FROM AccessLog
    WHERE event_date
    BETWEEN "04/7/2017" AND "04/08/2017"
    GROUP BY user_vote;

    View full-size slide

  22. SELECT
    user_vote, count(*)
    FROM AccessLog
    WHERE event_date
    BETWEEN"04/07/2017" AND "04/08/2007"
    GROUP BY user_vote;

    View full-size slide

  23. @gamussa @confluentinc @thephillyjug
    Lambda architecture origins
    http:/
    /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

    View full-size slide

  24. @gamussa @confluentinc @thephillyjug
    Lambda Architecture

    View full-size slide

  25. @gamussa @confluentinc @thephillyjug
    TFW Trying to explain modern big data
    landscape

    View full-size slide

  26. @gamussa @confluentinc @thephillyjug
    Precomputed Results
    http:/
    /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

    View full-size slide

  27. @gamussa @confluentinc @thephillyjug
    Batch Process
    http:/
    /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

    View full-size slide

  28. @gamussa @confluentinc @thephillyjug
    STREAM PROCESSING
    Data is motion

    View full-size slide

  29. @gamussa @confluentinc @thephillyjug
    Streaming Platform

    View full-size slide

  30. @gamussa @confluentinc @thephillyjug
    Streaming Platform

    View full-size slide

  31. @gamussa @confluentinc @thephillyjug
    Directed Acyclic Graph

    View full-size slide

  32. @gamussa @confluentinc @thephillyjug
    DEMO

    View full-size slide

  33. @gamussa @confluentinc @thephillyjug
    DEMO

    View full-size slide

  34. @gamussa @confluentinc @thephillyjug
    Interesting cases
    Before You Go

    View full-size slide

  35. I FOUND YOUR LACK OF FAULT TOLERANCE
    DISTURBING

    View full-size slide

  36. Data is too important to
    store it in one computer

    View full-size slide

  37. @gamussa @confluentinc @thephillyjug
    How to process
    «infinite» data?

    View full-size slide

  38. @gamussa @confluentinc @thephillyjug
    Time model

    View full-size slide

  39. @gamussa @confluentinc @thephillyjug
    Time model
    Different use cases time semantics

    View full-size slide

  40. @gamussa @confluentinc @thephillyjug
    Time model
    Different use cases time semantics
    Majority of use cases require event-
    time semantics

    View full-size slide

  41. @gamussa @confluentinc @thephillyjug
    Time model
    Different use cases time semantics
    Majority of use cases require event-
    time semantics
    Other use cases may require
    processing-time or special variants
    like ingestion-time

    View full-size slide

  42. @gamussa @confluentinc @thephillyjug
    Time Model

    View full-size slide

  43. @gamussa @confluentinc @thephillyjug
    Time Model

    View full-size slide

  44. @gamussa @confluentinc @thephillyjug
    Time Model

    View full-size slide

  45. Finite
    Representation
    Of
    Infinite
    Data

    View full-size slide

  46. @gamussa @confluentinc @thephillyjug
    Windowing
    Windowing is an operation that groups
    events

    View full-size slide

  47. @gamussa @confluentinc @thephillyjug
    https:/
    /www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

    View full-size slide

  48. @gamussa @confluentinc @thephillyjug
    Windowing
    Input data, where
    colors represent

    different users events
    Rectangles denote

    different event-time

    windows
    processing-time
    event-time
    windowing
    alice
    bob
    dave

    View full-size slide

  49. @gamussa @confluentinc @thephillyjug
    Windowing
    Windowing is an operation that groups
    events
    Most commonly needed: time windows,
    session windows
    Examples:
    ✗Real-time monitoring: 5-minute averages
    ✗Reader behavior on a website: user browsing sessions

    View full-size slide

  50. @gamussa @confluentinc @thephillyjug
    Fatality

    View full-size slide

  51. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data
    Is very common in practice, not a rare
    corner case
    ✗Related to time model discussion

    View full-size slide

  52. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data

    View full-size slide

  53. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity

    View full-size slide

  54. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity
    Emails are being written

    during the 10h flight

    View full-size slide

  55. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity
    Emails are being written

    during the 10h flight
    Internet connectivity is restored,

    phones will send queued emails now

    View full-size slide

  56. @gamussa @confluentinc @thephillyjug
    Stream Processing: results

    View full-size slide

  57. @gamussa @confluentinc @thephillyjug
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time

    View full-size slide

  58. @gamussa @confluentinc @thephillyjug
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time
    • Windows – finite view of infinite data
    • Based on temporal characteristics of the evet

    View full-size slide

  59. @gamussa @confluentinc @thephillyjug
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time
    • Windows – finite view of infinite data
    • Based on temporal characteristics of the evet
    • Late event processing
    • You choose how long to wait

    View full-size slide

  60. @gamussa @confluentinc @thephillyjug
    https://github.com/confluentinc/kafka-streams-examples

    View full-size slide

  61. @gamussa @confluentinc @thephillyjug
    Thanks!
    questions?
    @gamussa
    [email protected]

    View full-size slide