Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming analytics better than batch – when and why by Dawid Wysakowicz and Adam Kawa at Big Data Spain 2017

Streaming analytics better than batch – when and why by Dawid Wysakowicz and Adam Kawa at Big Data Spain 2017

While a lot of problems can be solved in batch, the stream processing approach currently gives you more benefits. And it’s not only sub-second latency at scale. But mainly possibility to express accurate analytics with little effort – something that is hard or usually ignored with older batch technologies like Pig, Scalding, Spark or even established stream processors like Storm or Spark Streaming.

https://www.bigdataspain.org/2017/talk/streaming-analytics-better-than-batch-when-and-why

Big Data Spain 2017
16th - 17th November Kinépolis Madrid

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

November 22, 2017
Tweet

Transcript

  1. None
  2. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Streaming analytics better than batch - when and why ? _Adam Kawa - Dawid Wysakowicz -_ Krzysztof Zarzycki_
  3. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Have you ever built cool Big Data pipelines?
  4. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent.
  5. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Example Use-Case ▪ Can be done in batch and real-time ▪ User session analytics at Spotify • Simple stats ▪ Duration, number of songs, skips, searches etc. • Advanced analytics ▪ Mood, physical activity, real-time content, ads
  6. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? _1. Dashboards_
  7. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short !!! _1. Dashboards_ _2. Alerts_
  8. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Example Output How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short !!! Recommend songs and ads based on current activity. _1. Dashboards_ _2. Alerts_ _3. Content_
  9. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. 1st - Batch Architecture 1h 1h 1h 1h - 1d 1h User Events User Sessions
  10. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. 1st - Batch Architecture 1h 1h 1h 1d 1h User Events User Sessions
  11. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. The More Moving Parts … ⬇ The higher learning curve ⬇ The more gluing code ⬇ The larger administrative effort ⬇ The more error-prone solution
  12. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Long Waiting Time Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external
  13. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. 2nd - Micro-Batch Architecture 1m - 1h
  14. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. ♪ ♪ No Built-In Session Windows ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [10:00 - 11:00) [11:00 - 12:00)
  15. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. ♪ ♪ No Built-In Session Windows ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ [10:00 - 11:00) [11:00 - 12:00)
  16. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Late Data … ♪ ♪ ♪ ♪ ♪ ♪ Event Time 14:55 - 16:35 Processing Time
  17. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. ... Included in Current Batch ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ 14:55 - 16:35 16:50 - … Event Time Processing Time
  18. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Out-Of-Order Data … ♪ ♫ ♪ Event Time Processing Time
  19. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Out-Of-Order Data … ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♪ ♫ Event Time Processing Time
  20. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Out-Of-Order Data … ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♪ ♫ Event Time Processing Time
  21. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. ... Breaks Correctness ♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫ ♪ Event Time Processing Time
  22. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Problems FILES, BATCHES, DATA LAKES
  23. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Solving Streaming Problem With Batch?
  24. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. 3rd - Streaming-First Architecture
  25. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. User Session Windows ♪ Case A ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ Case B ♪ ♪ ♪ ♪ ♪ ♪ Session gap eg. 15 minutes ♪ ♪ ♪ 5
  26. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. User Session Windows ♪ Case A ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ Case B ♪ ♪ ♪ ♪ ♪ ♪ Session gap eg. 15 minutes ♪ ♪ ♪ 5 [3,2]
  27. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Reading From Kafka val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
  28. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Session Windows With Gap val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ User 1 User 2
  29. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Session Windows With Gap val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) User 1 ♪ ♪ ♪ ♪ ♪ ♪ Session gap - 15 minutes ♪ ♪
  30. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Analyzing User Session val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
  31. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Handling Late Events val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
  32. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Triggering Early Results val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .trigger(EarlyTriggeringTrigger.every(Time.minutes(10))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
  33. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Sessionization Example val sessionStream : DataStream[SessionStats] = sEnv .addSource(new KafkaConsumer(...)) .keyBy(_.userId) .window(EventTimeSessionWindows.withGap(Time.minutes(15))) .trigger(EarlyTriggeringTrigger.every(Time.minutes(10))) .allowedLateness(Time.minutes(60)) .apply(new CountSessionStats()) Working example: https://github.com/getindata/flink-use-case
  34. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Modern Stream Processing Engines ▪ Rich stream processing semantic • Built-in support for event-time windows • Accurate results for late / out-of-order events and replays • Early triggers ▪ Low latency and high-throughput ▪ Exactly-once stateful processing
  35. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Modern Stream Processing Engines ▪ Rich stream processing semantic • Built-in support for event-time windows • Accurate results for late / out-of-order events and replays • Early triggers ▪ Low latency and high-throughput ▪ Exactly-once stateful processing User survey: http://data-artisans.com/flink-user-survey-2016-part-1 http://data-artisans.com/flink-user-survey-2016-part-2
  36. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent.
  37. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. How can I reprocess data?
  38. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Reprocessing Events In Flink 1. Take periodic snapshots of a job • It stores Kafka offsets, on-flight sessions, application state 2. Restart a job from a savepoint rather than from a beginning
  39. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. What if data is no longer in Kafka?
  40. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Consuming Data From HDFS ▪ Run your streaming code on HDFS (bounded data) • You need to read data in event-time based order • Implement mechanism of proper watermark generation
  41. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. What are usual stream processing applications?
  42. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Stream Analytics Image source: https://www.slideshare.net/sinisalyh/storm-at-spotify
  43. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Stream 24/7 Applications
  44. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. When is batch processing good?
  45. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Batch Processing Use-Cases ▪ Ad-hoc analytics and data exploration • Notebooks, Spark/Flink/Hive, Parquet, complete data sets ▪ Technical advantages • A large swaths of historical data in HDFS • High-level libraries in mature batch technologies
  46. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Batch Processing Use-Cases ▪ Ad-hoc analytics and data exploration • Notebooks, Spark/Flink/Hive, Parquet, complete data sets ▪ Implementation advantages • Offline experiments over large historical data ▪ Historical events are usually stored in HDFS, not Kafka • High-level libraries in batch processing technologies ▪ Spark MLlib, H2O (when data arrives continuously) don’t solve streaming problem with batch jobs
  47. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Who Are You, actually? ▪ At GetInData, we build custom Big Data solutions • Hadoop, Flink, Spark, Kafka and more ▪ Our team is today represented by Krzysztof Zarzycki Dawid Wysakowicz Adam Kawa
  48. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. ▪ Stream often the natural representation of your data ▪ Stream processing is not only about low latency Summary
  49. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Q&A
  50. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Thanks !
  51. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent.
  52. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Log Abstraction 11:00 - 12:00 12:00 - 13:00 … … 10:00 - … 10:00 - … 10:00 - 11:00
  53. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Spark Structured Streaming ⬇ Operates on top of micro-batches (Spark SQL engine) ▪ The ALPHA version and the experimental API until July 11, 2017 ⬆ Easy-to-learn API (Dataset/DataFrame) ⬆ Rich ecosystem of tools and libraries e.g. MLlib ⬆ Supports event-time ⬇ Sessionization not yet supported - SPARK-10816 ⬇ Queryable state not yet supported - SPARK-16738
  54. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Kafka Streams ⬇ No exactly-once (just at-least-once) ⬇ Kafka as the only data source ⬇ No bounded streams (batch) optimizations ⬆ Simplicity ⬆ Embedded into application ⬆ Supports event-time ⬇ Lack of session windows
  55. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Apache Beam ⬆ Unified API for batch and streaming ⬆ Rich streaming processing semantics ⬆ Complex TriggerDSL ⬆ Multiple runtime environments ⬆ Spark, Flink, Apex, Dataflow ⬆ Side inputs and outputs ⬇ Verbose Java API ⬇ New project - Top level since 01/2017
  56. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Google Dataflow ▪ Runtime environment for Apache Beam in Google Cloud ⬇ No support for Iterative Computations ⬆ Supports Side Outputs ⬆ Works with every Google Cloud Service (Pub/Sub, BigTable etc.)
  57. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. How to join with other data sets/streams?
  58. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Join With Other Datasets / Streams ▪ Flink can join windowed streams easily ▪ Join of data stream with data set is WIP • Even with slowly changing data set! • Even keyed data Stream 2 Stream 1 Joined Stream Input Stream Joined Stream + Id Name 1 John Doe 2 Jane Doe Dataset +
  59. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. I like this streaming API. Can I use it for batch?
  60. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Unified batch and streaming API ▪ Not with raw Flink API ▪ But with Flink Table API ▪ Apache Beam
  61. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Is Flink production ready?
  62. © Copyright. All rights reserved. Not to be reproduced without

    prior written consent. Powered By Flink