Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas at Big Data Spain 2015

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas at Big Data Spain 2015

Flink is one of the largest and most active Apache big data projects with well over 120 contributors

Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract:http://www.bigdataspain.org/program/fri/slot-31.html

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

October 22, 2015
Tweet

Transcript

  1. None
  2. Kostas Tzoumas @kostas_tzoumas #BDS15 Apache FlinkTM: Stream processing as a

    basis for all analytics
  3. 2 1. A bit of history 2. The streaming era

    and Flink 3. Inside Flink 0.10 4. Towards Flink 1.0 and beyond
  4. A bit of history From incubation until now 3

  5. 4 Apr 2014 Jun 2015 Dec 2014 0. 7 0.

    6 0. 5 0. 9 0.9-m1 0.1 0 Oct 2015 Top level 0. 8
  6. Community growth Flink is one of the largest and most

    active Apache big data projects with well over 120 contributors 5
  7. Speakers at Flink Forward 2015 6

  8. Flink meetups around the globe 7

  9. Featured in 8

  10. The streaming era Welcome to 9

  11. 10 Streaming is the biggest change in data infrastructure since

    Hadoop
  12. 11 1. Radically simplified infrastructure 2. Internet of Things, on-demand

    services 3. Can completely subsume batch
  13. 12 In a world of events and isolated apps, the

    stream processor is the backbone of the data infrastructure App App App App App App local view local view local view local view local view local view Consistent movement, analytics App App App App App App Global view Consistent store Global view Consistent store
  14. 13  Until now, stream processors were less mature than

    batch processors  This led to • in-house solutions • abuse of batch processors • Lambda architectures  This is no longer the case
  15. 14 Flink 0.10 With the upcoming 0.10 release, Flink significantly

    surpasses the state of the art in open source stream processing systems. And, we are heading to Flink 1.0 after that.
  16. 15  Streaming technology has matured • e.g., Flink, Kafka,

    Dataflow  Flink and Dataflow duality • a Google technology • an open source Apache project • compatible via Flink runner +
  17. Flink 0.10 Flink for the streaming era 16

  18. The new DataStream API  Difference between programming for batch

    and streaming is dealing with time and state  Flink 0.10 comes with a new DataStream API • Same as batch if you do not care about time • Smooth transition from batch for basic time manipulation • Powerful tools for dealing with time and state when you need them 17
  19. The new DataStream API 18 case cl ass Event (

    l ocat i on: Locat i on, num Vehi cl es: Long) val st r eam : D at aSt r eam [ Event ] = … ; st r eam . fi l t er { evt = > i sI nt ersect i on( evt . l ocat i on) }
  20. The new DataStream API 19 case cl ass Event (

    l ocat i on: Locat i on, num Vehi cl es: Long) val st r eam : D at aSt r eam [ Event ] = … ; st r eam . fi l t er { evt = > i sI nt ersect i on( evt . l ocat i on) } . keyBy( "l ocati on") . t i m eW i ndow ( Ti m e. of ( 15, M I NUTES) , Ti m e. of ( 5, M I NUTES) ) . sum ( "num Vehi cl es")
  21. The new DataStream API 20 case cl ass Event (

    l ocat i on: Locat i on, num Vehi cl es: Long) val st r eam : D at aSt r eam [ Event ] = … ; st r eam . fi l t er { evt = > i sI nt ersect i on( evt . l ocat i on) } . keyBy( "l ocati on") . t i m eW i ndow ( Ti m e. of ( 15, M I NUTES) , Ti m e. of ( 5, M I NUTES) ) . t ri gger( new Thr eshol d( 200) ) . sum ( "num Vehi cl es")
  22. The new DataStream API 21 case cl ass Event (

    l ocat i on: Locat i on, num Vehi cl es: Long) val st r eam : D at aSt r eam [ Event ] = … ; st r eam . fi l t er { evt = > i sI nt ersect i on( evt . l ocat i on) } . keyBy( "l ocati on") . t i m eW i ndow ( Ti m e. of ( 15, M I NUTES) , Ti m e. of ( 5, M I NUTES) ) . t ri gger( new Thr eshol d( 200) ) . sum ( "num Vehi cl es") . keyBy( evt = > evt . l ocat i on. gri d ) . m apW i t hSt at e { ( evt , st at e: O pt i on[ M odel ] ) = > { val m odel = st at e. orEl se(new M odel ( ) ) ( m odel . cl assi f y( evt ) , Som e( m odel . updat e( evt ) ) ) }}
  23. IoT / Mobile Applications 22 Events occur on devices Queue

    / Log Queue / Log Events analyzed in a data streaming system Stream Analysis Stream Analysis Events stored in a log
  24. IoT / Mobile Applications 23

  25. IoT / Mobile Applications 24

  26. IoT / Mobile Applications 25

  27. IoT / Mobile Applications 26 Out of order !!! First

    burst of events Second burst of events
  28. IoT / Mobile Applications 27 Event time windows Arrival time

    windows Instant event-at-a-time Flink supports out of order time (event time) windows, arrival time windows (and mixtures) plus low latency processing. First burst of events Second burst of events
  29. High Availability and Consistency 28 No Single-Point-Of-Failure any more Exactly-once

    processing semantics across pipeline Checkpoints/Fault Tolerance is decoupled from windows  Allows for highly flexible window implementations ZooKeeper ensemble Multiple Masters failover
  30. Performance 29 Continuous streaming Latency-bound buffering Distributed Snapshots High Throughput

    & Low Latency With configurable throughput/latency tradeof
  31. Batch and Streaming 30 case cl ass W or dCount

    ( w or d: St ri ng, count : I nt ) val t ext : D at aSt r eam [ St ri ng] = … ; t ext . fl at M ap { l i ne = > l i ne. spl i t (" ") } . m ap { w or d = > new W or dCount ( w or d, 1) } . keyBy( "w ord") . w i ndow ( G l obal W i ndow s. cr eat e( ) ) . t ri gger( new EO FTri gger( ) ) . sum ( "count") Batch Word Count in the DataStream API
  32. Batch and Streaming 31 Batch Word Count in the DataSet

    API case cl ass W or dCount ( w or d: St ri ng, count : I nt ) val t ext : D at aSt r eam [ St ri ng] = … ; t ext . fl atM ap { l i ne = > l i ne. spl i t (" ") } . m ap { w or d = > new W or dCount ( w or d, 1) } . keyB y( "w ord") . w i ndow ( G l obal W i ndow s. cr eat e( ) ) . tri gger( new EO FTri gger( ) ) . sum ( "count") val t ext : D at aSet [ St ri ng] = … ; t ext . fl at M ap { l i ne = > l i ne. spl i t (" ") } . m ap { w or d = > new W or dCount ( w or d, 1) } . gr oupBy( "w ord") . sum ( "count")
  33. Batch and Streaming 32 Pipelined and blocking operators Streaming Dataflow

    Runtime Batch Parameters DataSet DataSet DataStream DataStream Relational Optimizer Relational Optimizer Window Optimization Window Optimization Pipelined and windowed operators Schedule lazily Schedule eagerly Recompute whole operators Periodic checkpoints Streaming data movement Stateful operations DAG recovery Fully bufered streams DAG resource management Streaming Parameters
  34. Batch and Streaming 33 A full-fledged batch processor as well

  35. Batch and Streaming 34 A full-fledged batch processor as well

    See talk at Flink Forward 2015 by Dongwon Kim: "A comparative performance evaluation of Flink"
  36. Integration (picture not complete) 35 POSIX Java/Scala Collections POSIX

  37. Monitoring 36 Life system metrics and user-defined accumulators/statistics G et

    ht t p: / / fl i nk- m : 8081/ j obs/ 7684be6004e4e955c2a558a9bc463f 65/ accum ul at ors Monitoring REST API for custom monitoring tools { "i d": "dceaf e2df 1f 57a1206f cb907cb38ad97", "user - accum ul at ors": [ { "nam e": "avgl en", "t ype": "D oubl eCount er", "val ue": "123. 03259440000001" }, { "nam e": "genw or ds", "t ype": "LongCount er", "val ue": "75000000" } ] }
  38. Towards Flink 1.0 and beyond Where we see the project

    going 37
  39. Towards Flink 1.0  Flink 1.0 is around the corner

     Focus on defining public APIs and automatic API compatibility checks  Guarantee backwards compatibility in all Flink 1.X versions 38
  40. Beyond Flink 1.0  Flink engine has most features in

    place  Focus on usability features on top of DataStream API • e.g., SQL, ML, more connectors  Continue work on elasticity and memory management 39
  41. Wrap up 40

  42. 41 tl;dr  Streaming is happening  Better adapt now

     Flink 0.10: a modern, ready to use open source stream processor
  43. 42  Read more • flink.apache.org/blog • data-artisans.com/blog  Subscribe

    to the mailing lists  Follow @ApacheFlink  Get involved at a local meetup
  44. Appendix

  45. 44 batch event based need new systems well served