Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

Viktor Gamov
September 13, 2017

[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

Data is flowing everywhere around us, from phones, credit cards, sensor-equipped buildings, vending machines, thermostats, trains, buses,planes, posts to social media, digital pictures and video and so on.Simple data collection is not enough anymore. Most of the current systems do data processing via nightly extract, transform, and load (ETL)operations, which is common in enterprise environments, requires decision makers to wait an entire day (or night) for reports to become available.

But businesses don’t want «Big Data» anymore. They want «Fast Data».What distinguishes a «streaming systems» from the batch systems is that the event stream is unbounded or “infinite” from a system perspective.

Decision-makers need to analyze these streaming events as a whole to make business decisions as new information arrives.In this talk, after a short introduction to common approaches and architectures (lambda, kappa), Viktor will demonstrate how to use open-source steam processing tools (Flink, Kafka Streams, Hazelcast Jet) for stream processing.

Viktor Gamov

September 13, 2017
Tweet

More Decks by Viktor Gamov

Other Decks in Programming

Transcript

  1. @gamussa @confluentinc @thephillyjug
    Divide, Distribute and Conquer:

    Stream v. Batch

    View Slide

  2. Stream v. Batch

    View Slide

  3. Who am I?

    View Slide

  4. Solutions Architect
    Who am I?

    View Slide

  5. Solutions Architect
    Developer Advocate
    Who am I?

    View Slide

  6. Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Who am I?

    View Slide

  7. Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Hey you, yes, you, go follow me in twitter ©
    Who am I?

    View Slide

  8. @gamussa @confluentinc @thephillyjug
    Disclaimer:


    View Slide

  9. @gamussa @confluentinc @thephillyjug
    BATCH PROCESSING
    Data at rest

    View Slide

  10. @gamussa @confluentinc @thephillyjug
    Data and Queries
    Origin and processing

    View Slide

  11. @gamussa @confluentinc @thephillyjug

    View Slide

  12. @gamussa @confluentinc @thephillyjug
    Data…

    View Slide

  13. @gamussa @confluentinc @thephillyjug
    Data…

    View Slide

  14. @gamussa @confluentinc @thephillyjug
    ✓ … inherently immutable
    Data…
    ✓ … time-based

    View Slide

  15. @gamussa @confluentinc @thephillyjug
    CRUD -> CR

    View Slide

  16. @gamussa @confluentinc @thephillyjug
    Processing is a query

    View Slide

  17. @gamussa @confluentinc @thephillyjug
    Processing is a query
    Function on full data set

    View Slide

  18. @gamussa @confluentinc @thephillyjug
    Processing is a query
    Function on full data set
    Projection

    View Slide

  19. @gamussa @confluentinc @thephillyjug
    Processing is a query
    Function on full data set
    Projection
    Aggregations

    View Slide

  20. @gamussa @confluentinc @thephillyjug
    Processing is a query
    Function on full data set
    Projection
    Aggregations
    Joins

    View Slide

  21. SELECT
    user_vote, count(*)
    FROM AccessLog
    WHERE event_date
    BETWEEN"04/07/2017" AND "04/07/2017"
    GROUP BY user_vote;

    View Slide

  22. SELECT
    user_vote, count(*)
    FROM AccessLog
    WHERE event_date
    BETWEEN "04/7/2017" AND "04/08/2017"
    GROUP BY user_vote;

    View Slide

  23. SELECT
    user_vote, count(*)
    FROM AccessLog
    WHERE event_date
    BETWEEN"04/07/2017" AND "04/08/2007"
    GROUP BY user_vote;

    View Slide

  24. @gamussa @confluentinc @thephillyjug
    Lambda architecture origins
    http:/
    /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

    View Slide

  25. View Slide

  26. @gamussa @confluentinc @thephillyjug
    Lambda Architecture

    View Slide

  27. View Slide

  28. @gamussa @confluentinc @thephillyjug
    TFW Trying to explain modern big data
    landscape

    View Slide

  29. @gamussa @confluentinc @thephillyjug
    Precomputed Results
    http:/
    /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

    View Slide

  30. @gamussa @confluentinc @thephillyjug
    Batch Process
    http:/
    /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

    View Slide

  31. @gamussa @confluentinc @thephillyjug
    STREAM PROCESSING
    Data is motion

    View Slide

  32. @gamussa @confluentinc @thephillyjug
    Streaming Platform

    View Slide

  33. @gamussa @confluentinc @thephillyjug
    Streaming Platform

    View Slide

  34. @gamussa @confluentinc @thephillyjug
    Directed Acyclic Graph

    View Slide

  35. @gamussa @confluentinc @thephillyjug
    DEMO

    View Slide

  36. @gamussa @confluentinc @thephillyjug
    DEMO

    View Slide

  37. @gamussa @confluentinc @thephillyjug
    Interesting cases
    Before You Go

    View Slide

  38. I FOUND YOUR LACK OF FAULT TOLERANCE
    DISTURBING

    View Slide

  39. Data is too important to
    store it in one computer

    View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. @gamussa @confluentinc @thephillyjug
    How to process
    «infinite» data?

    View Slide

  45. @gamussa @confluentinc @thephillyjug
    Time model

    View Slide

  46. @gamussa @confluentinc @thephillyjug
    Time model
    Different use cases time semantics

    View Slide

  47. @gamussa @confluentinc @thephillyjug
    Time model
    Different use cases time semantics
    Majority of use cases require event-
    time semantics

    View Slide

  48. @gamussa @confluentinc @thephillyjug
    Time model
    Different use cases time semantics
    Majority of use cases require event-
    time semantics
    Other use cases may require
    processing-time or special variants
    like ingestion-time

    View Slide

  49. @gamussa @confluentinc @thephillyjug
    Time Model

    View Slide

  50. @gamussa @confluentinc @thephillyjug
    Time Model

    View Slide

  51. @gamussa @confluentinc @thephillyjug
    Time Model

    View Slide

  52. Finite
    Representation
    Of
    Infinite
    Data

    View Slide

  53. @gamussa @confluentinc @thephillyjug
    Windowing
    Windowing is an operation that groups
    events

    View Slide

  54. @gamussa @confluentinc @thephillyjug
    https:/
    /www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

    View Slide

  55. @gamussa @confluentinc @thephillyjug
    Windowing
    Input data, where
    colors represent

    different users events
    Rectangles denote

    different event-time

    windows
    processing-time
    event-time
    windowing
    alice
    bob
    dave

    View Slide

  56. @gamussa @confluentinc @thephillyjug
    Windowing
    Windowing is an operation that groups
    events
    Most commonly needed: time windows,
    session windows
    Examples:
    ✗Real-time monitoring: 5-minute averages
    ✗Reader behavior on a website: user browsing sessions

    View Slide

  57. @gamussa @confluentinc @thephillyjug
    Fatality

    View Slide

  58. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data
    Is very common in practice, not a rare
    corner case
    ✗Related to time model discussion

    View Slide

  59. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data

    View Slide

  60. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity

    View Slide

  61. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity
    Emails are being written

    during the 10h flight

    View Slide

  62. @gamussa @confluentinc @thephillyjug
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity
    Emails are being written

    during the 10h flight
    Internet connectivity is restored,

    phones will send queued emails now

    View Slide

  63. @gamussa @confluentinc @thephillyjug
    Stream Processing: results

    View Slide

  64. @gamussa @confluentinc @thephillyjug
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time

    View Slide

  65. @gamussa @confluentinc @thephillyjug
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time
    • Windows – finite view of infinite data
    • Based on temporal characteristics of the evet

    View Slide

  66. @gamussa @confluentinc @thephillyjug
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time
    • Windows – finite view of infinite data
    • Based on temporal characteristics of the evet
    • Late event processing
    • You choose how long to wait

    View Slide

  67. @gamussa @confluentinc @thephillyjug
    https://github.com/confluentinc/kafka-streams-examples

    View Slide

  68. @gamussa @confluentinc @thephillyjug
    Thanks!
    questions?
    @gamussa
    [email protected]

    View Slide