[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

0680be1c881abcf19219f09f1e8cf140?s=47 Viktor Gamov
September 13, 2017

[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

Data is flowing everywhere around us, from phones, credit cards, sensor-equipped buildings, vending machines, thermostats, trains, buses,planes, posts to social media, digital pictures and video and so on.Simple data collection is not enough anymore. Most of the current systems do data processing via nightly extract, transform, and load (ETL)operations, which is common in enterprise environments, requires decision makers to wait an entire day (or night) for reports to become available.

But businesses don’t want «Big Data» anymore. They want «Fast Data».What distinguishes a «streaming systems» from the batch systems is that the event stream is unbounded or “infinite” from a system perspective.

Decision-makers need to analyze these streaming events as a whole to make business decisions as new information arrives.In this talk, after a short introduction to common approaches and architectures (lambda, kappa), Viktor will demonstrate how to use open-source steam processing tools (Flink, Kafka Streams, Hazelcast Jet) for stream processing.

0680be1c881abcf19219f09f1e8cf140?s=128

Viktor Gamov

September 13, 2017
Tweet

Transcript

  1. @gamussa @confluentinc @thephillyjug Divide, Distribute and Conquer:
 Stream v. Batch

  2. Stream v. Batch

  3. Who am I?

  4. Solutions Architect Who am I?

  5. Solutions Architect Developer Advocate Who am I?

  6. Solutions Architect Developer Advocate @gamussa in internetz Who am I?

  7. Solutions Architect Developer Advocate @gamussa in internetz Hey you, yes,

    you, go follow me in twitter © Who am I?
  8. @gamussa @confluentinc @thephillyjug Disclaimer:
 


  9. @gamussa @confluentinc @thephillyjug BATCH PROCESSING Data at rest

  10. @gamussa @confluentinc @thephillyjug Data and Queries Origin and processing

  11. @gamussa @confluentinc @thephillyjug

  12. @gamussa @confluentinc @thephillyjug Data…

  13. @gamussa @confluentinc @thephillyjug Data…

  14. @gamussa @confluentinc @thephillyjug ✓ … inherently immutable Data… ✓ …

    time-based
  15. @gamussa @confluentinc @thephillyjug CRUD -> CR

  16. @gamussa @confluentinc @thephillyjug Processing is a query

  17. @gamussa @confluentinc @thephillyjug Processing is a query Function on full

    data set
  18. @gamussa @confluentinc @thephillyjug Processing is a query Function on full

    data set Projection
  19. @gamussa @confluentinc @thephillyjug Processing is a query Function on full

    data set Projection Aggregations
  20. @gamussa @confluentinc @thephillyjug Processing is a query Function on full

    data set Projection Aggregations Joins
  21. SELECT user_vote, count(*) FROM AccessLog WHERE event_date BETWEEN"04/07/2017" AND "04/07/2017"

    GROUP BY user_vote;
  22. SELECT user_vote, count(*) FROM AccessLog WHERE event_date BETWEEN "04/7/2017" AND

    "04/08/2017" GROUP BY user_vote;
  23. SELECT user_vote, count(*) FROM AccessLog WHERE event_date BETWEEN"04/07/2017" AND "04/08/2007"

    GROUP BY user_vote;
  24. @gamussa @confluentinc @thephillyjug Lambda architecture origins http:/ /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

  25. None
  26. @gamussa @confluentinc @thephillyjug Lambda Architecture

  27. None
  28. @gamussa @confluentinc @thephillyjug TFW Trying to explain modern big data

    landscape
  29. @gamussa @confluentinc @thephillyjug Precomputed Results http:/ /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

  30. @gamussa @confluentinc @thephillyjug Batch Process http:/ /nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

  31. @gamussa @confluentinc @thephillyjug STREAM PROCESSING Data is motion

  32. @gamussa @confluentinc @thephillyjug Streaming Platform

  33. @gamussa @confluentinc @thephillyjug Streaming Platform

  34. @gamussa @confluentinc @thephillyjug Directed Acyclic Graph

  35. @gamussa @confluentinc @thephillyjug DEMO

  36. @gamussa @confluentinc @thephillyjug DEMO

  37. @gamussa @confluentinc @thephillyjug Interesting cases Before You Go

  38. I FOUND YOUR LACK OF FAULT TOLERANCE DISTURBING

  39. Data is too important to store it in one computer

  40. None
  41. None
  42. None
  43. None
  44. @gamussa @confluentinc @thephillyjug How to process «infinite» data?

  45. @gamussa @confluentinc @thephillyjug Time model

  46. @gamussa @confluentinc @thephillyjug Time model Different use cases time semantics

  47. @gamussa @confluentinc @thephillyjug Time model Different use cases time semantics

    Majority of use cases require event- time semantics
  48. @gamussa @confluentinc @thephillyjug Time model Different use cases time semantics

    Majority of use cases require event- time semantics Other use cases may require processing-time or special variants like ingestion-time
  49. @gamussa @confluentinc @thephillyjug Time Model

  50. @gamussa @confluentinc @thephillyjug Time Model

  51. @gamussa @confluentinc @thephillyjug Time Model

  52. Finite Representation Of Infinite Data

  53. @gamussa @confluentinc @thephillyjug Windowing Windowing is an operation that groups

    events
  54. @gamussa @confluentinc @thephillyjug https:/ /www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

  55. @gamussa @confluentinc @thephillyjug Windowing Input data, where colors represent
 different

    users events Rectangles denote
 different event-time
 windows processing-time event-time windowing alice bob dave
  56. @gamussa @confluentinc @thephillyjug Windowing Windowing is an operation that groups

    events Most commonly needed: time windows, session windows Examples: ✗Real-time monitoring: 5-minute averages ✗Reader behavior on a website: user browsing sessions
  57. @gamussa @confluentinc @thephillyjug Fatality

  58. @gamussa @confluentinc @thephillyjug Out-of-order and late data Is very common

    in practice, not a rare corner case ✗Related to time model discussion
  59. @gamussa @confluentinc @thephillyjug Out-of-order and late data

  60. @gamussa @confluentinc @thephillyjug Out-of-order and late data Users with mobile

    phones enter
 airplane, lose Internet connectivity
  61. @gamussa @confluentinc @thephillyjug Out-of-order and late data Users with mobile

    phones enter
 airplane, lose Internet connectivity Emails are being written
 during the 10h flight
  62. @gamussa @confluentinc @thephillyjug Out-of-order and late data Users with mobile

    phones enter
 airplane, lose Internet connectivity Emails are being written
 during the 10h flight Internet connectivity is restored,
 phones will send queued emails now
  63. @gamussa @confluentinc @thephillyjug Stream Processing: results

  64. @gamussa @confluentinc @thephillyjug Stream Processing: results • Yes, it’s possible

    to get computation results in real time
  65. @gamussa @confluentinc @thephillyjug Stream Processing: results • Yes, it’s possible

    to get computation results in real time • Windows – finite view of infinite data • Based on temporal characteristics of the evet
  66. @gamussa @confluentinc @thephillyjug Stream Processing: results • Yes, it’s possible

    to get computation results in real time • Windows – finite view of infinite data • Based on temporal characteristics of the evet • Late event processing • You choose how long to wait
  67. @gamussa @confluentinc @thephillyjug https://github.com/confluentinc/kafka-streams-examples

  68. @gamussa @confluentinc @thephillyjug Thanks! questions? @gamussa viktor@confluent.io