Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[DataSciCon] Divide, Distribute and Conquer: Stream v. Batch

[DataSciCon] Divide, Distribute and Conquer: Stream v. Batch

Data is flowing everywhere around us, from phones, credit cards, sensor-equipped buildings, vending machines, thermostats, trains, buses, planes, posts to social media, digital pictures and video and so on....

http://www.datascicon.tech

Viktor Gamov

November 30, 2017
Tweet

More Decks by Viktor Gamov

Other Decks in Technology

Transcript

  1. DIVIDE, DISTRIBUTE AND CONQUER:

    STREAM V. BATCH

    View Slide

  2. Stream v. Batch

    View Slide

  3. Who am I?

    View Slide

  4. Solutions Architect
    Who am I?

    View Slide

  5. Solutions Architect
    Developer Advocate
    Who am I?

    View Slide

  6. Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Who am I?

    View Slide

  7. Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Hey you, yes, you, go follow me in twitter ©
    Who am I?

    View Slide

  8. @gamussa @confluentinc @DataSciCon
    BATCH PROCESSING
    Data at rest

    View Slide

  9. @gamussa @confluentinc @DataSciCon
    Data and Queries
    Origin and processing

    View Slide

  10. @gamussa @confluentinc @DataSciCon

    View Slide

  11. @gamussa @confluentinc @DataSciCon
    Data…

    View Slide

  12. @gamussa @confluentinc @DataSciCon
    Data…

    View Slide

  13. @gamussa @confluentinc @DataSciCon
    ✓ … inherently immutable
    Data…
    ✓ … time-based

    View Slide

  14. @gamussa @confluentinc @DataSciCon
    CRUD -> CR

    View Slide

  15. @gamussa @confluentinc @DataSciCon
    Processing is a query

    View Slide

  16. @gamussa @confluentinc @DataSciCon
    Processing is a query
    Function on full data set

    View Slide

  17. @gamussa @confluentinc @DataSciCon
    Processing is a query
    Function on full data set
    Projection

    View Slide

  18. @gamussa @confluentinc @DataSciCon
    Processing is a query
    Function on full data set
    Projection
    Aggregations

    View Slide

  19. @gamussa @confluentinc @DataSciCon
    Processing is a query
    Function on full data set
    Projection
    Aggregations
    Joins

    View Slide

  20. @gamussa @confluentinc @DataSciCon
    Lambda architecture origins
    http:/
    /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

    View Slide

  21. View Slide

  22. View Slide

  23. @gamussa @confluentinc @DataSciCon
    https://mapr.com/developercentral/lambda-architecture/
    Lambda Architecture

    View Slide

  24. @gamussa @confluentinc @DataSciCon

    View Slide

  25. @gamussa @confluentinc @DataSciCon
    TFW Trying to explain modern big data
    landscape

    View Slide

  26. @gamussa @confluentinc @DataSciCon

    View Slide

  27. @gamussa @confluentinc @DataSciCon
    STREAM PROCESSING
    Data is motion

    View Slide

  28. @gamussa @confluentinc @DataSciCon
    Streaming Platform

    View Slide

  29. @gamussa @confluentinc @DataSciCon
    Streaming Platform

    View Slide

  30. @gamussa @confluentinc @DataSciCon

    View Slide

  31. @gamussa @confluentinc @DataSciCon
    Interesting cases
    Before You Go

    View Slide

  32. I FOUND YOUR LACK OF FAULT TOLERANCE
    DISTURBING

    View Slide

  33. Data is too important to
    store it in one computer

    View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. @gamussa @confluentinc @DataSciCon
    How to process
    «infinite» data?

    View Slide

  39. @gamussa @confluentinc @DataSciCon
    Time model

    View Slide

  40. @gamussa @confluentinc @DataSciCon
    Time model
    Different use cases time semantics

    View Slide

  41. @gamussa @confluentinc @DataSciCon
    Time model
    Different use cases time semantics
    Majority of use cases require event-
    time semantics

    View Slide

  42. @gamussa @confluentinc @DataSciCon
    Time model
    Different use cases time semantics
    Majority of use cases require event-
    time semantics
    Other use cases may require
    processing-time or special variants
    like ingestion-time

    View Slide

  43. @gamussa @confluentinc @DataSciCon
    Time Model

    View Slide

  44. @gamussa @confluentinc @DataSciCon
    Time Model

    View Slide

  45. @gamussa @confluentinc @DataSciCon
    Time Model

    View Slide

  46. @gamussa @confluentinc @DataSciCon
    Windowing
    Input data, where
    colors represent

    different users events
    Rectangles denote

    different event-time

    windows
    processing-time
    event-time
    windowing
    alice
    bob
    dave

    View Slide

  47. @gamussa @confluentinc @DataSciCon
    https:/
    /www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

    View Slide

  48. @gamussa @confluentinc @DataSciCon
    Windowing
    Windowing is an operation that groups
    events
    Most commonly needed: time windows,
    session windows
    Examples:
    ✗Real-time monitoring: 5-minute averages
    ✗Reader behavior on a website: user browsing sessions

    View Slide

  49. @gamussa @confluentinc @DataSciCon
    Out-of-order and late data
    Is very common in practice, not a rare
    corner case
    ✗Related to time model discussion

    View Slide

  50. @gamussa @confluentinc @DataSciCon
    Out-of-order and late data

    View Slide

  51. @gamussa @confluentinc @DataSciCon
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity

    View Slide

  52. @gamussa @confluentinc @DataSciCon
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity
    Emails are being written

    during the 10h flight

    View Slide

  53. @gamussa @confluentinc @DataSciCon
    Out-of-order and late data
    Users with mobile phones enter

    airplane, lose Internet connectivity
    Emails are being written

    during the 10h flight
    Internet connectivity is restored,

    phones will send queued emails now

    View Slide

  54. @gamussa @confluentinc @DataSciCon
    Stream Processing: results

    View Slide

  55. @gamussa @confluentinc @DataSciCon
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time

    View Slide

  56. @gamussa @confluentinc @DataSciCon
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time
    • Windows – finite view of infinite data
    • Based on temporal characteristics of the evet

    View Slide

  57. @gamussa @confluentinc @DataSciCon
    Stream Processing: results
    • Yes, it’s possible to get computation
    results in real time
    • Windows – finite view of infinite data
    • Based on temporal characteristics of the evet
    • Late event processing
    • You choose how long to wait

    View Slide

  58. @gamussa @confluentinc @DataSciCon
    DEMO
    Let’s analyze flights

    View Slide

  59. @gamussa @confluentinc @DataSciCon
    https://www.confluent.io/blog/predicting-flight-arrivals-with-the-apache-kafka-streams-api/

    View Slide

  60. @gamussa @confluentinc @DataSciCon
    Example: Training Flight Prediction Model

    View Slide

  61. @gamussa @confluentinc @DataSciCon
    https://github.com/confluentinc/online-inferencing-blog-
    application

    View Slide

  62. @gamussa @confluentinc @DataSciCon
    Thanks!
    questions?
    @gamussa
    [email protected]

    View Slide