Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas at Big Data Spain 2015

Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas at Big Data Spain 2015

Flink is one of the largest and most active Apache big data projects with well over 120 contributors

Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract:http://www.bigdataspain.org/program/fri/slot-31.html

Big Data Spain

October 22, 2015
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. 2 1. A bit of history 2. The streaming era

    and Flink 3. Inside Flink 0.10 4. Towards Flink 1.0 and beyond
  2. 4 Apr 2014 Jun 2015 Dec 2014 0. 7 0.

    6 0. 5 0. 9 0.9-m1 0.1 0 Oct 2015 Top level 0. 8
  3. Community growth Flink is one of the largest and most

    active Apache big data projects with well over 120 contributors 5
  4. 12 In a world of events and isolated apps, the

    stream processor is the backbone of the data infrastructure App App App App App App local view local view local view local view local view local view Consistent movement, analytics App App App App App App Global view Consistent store Global view Consistent store
  5. 13  Until now, stream processors were less mature than

    batch processors  This led to • in-house solutions • abuse of batch processors • Lambda architectures  This is no longer the case
  6. 14 Flink 0.10 With the upcoming 0.10 release, Flink significantly

    surpasses the state of the art in open source stream processing systems. And, we are heading to Flink 1.0 after that.
  7. 15  Streaming technology has matured • e.g., Flink, Kafka,

    Dataflow  Flink and Dataflow duality • a Google technology • an open source Apache project • compatible via Flink runner +
  8. The new DataStream API  Difference between programming for batch

    and streaming is dealing with time and state  Flink 0.10 comes with a new DataStream API • Same as batch if you do not care about time • Smooth transition from batch for basic time manipulation • Powerful tools for dealing with time and state when you need them 17
  9. The new DataStream API 18 case cl ass Event (

    l ocat i on: Locat i on, num Vehi cl es: Long) val st r eam : D at aSt r eam [ Event ] = … ; st r eam . fi l t er { evt = > i sI nt ersect i on( evt . l ocat i on) }
  10. The new DataStream API 19 case cl ass Event (

    l ocat i on: Locat i on, num Vehi cl es: Long) val st r eam : D at aSt r eam [ Event ] = … ; st r eam . fi l t er { evt = > i sI nt ersect i on( evt . l ocat i on) } . keyBy( "l ocati on") . t i m eW i ndow ( Ti m e. of ( 15, M I NUTES) , Ti m e. of ( 5, M I NUTES) ) . sum ( "num Vehi cl es")
  11. The new DataStream API 20 case cl ass Event (

    l ocat i on: Locat i on, num Vehi cl es: Long) val st r eam : D at aSt r eam [ Event ] = … ; st r eam . fi l t er { evt = > i sI nt ersect i on( evt . l ocat i on) } . keyBy( "l ocati on") . t i m eW i ndow ( Ti m e. of ( 15, M I NUTES) , Ti m e. of ( 5, M I NUTES) ) . t ri gger( new Thr eshol d( 200) ) . sum ( "num Vehi cl es")
  12. The new DataStream API 21 case cl ass Event (

    l ocat i on: Locat i on, num Vehi cl es: Long) val st r eam : D at aSt r eam [ Event ] = … ; st r eam . fi l t er { evt = > i sI nt ersect i on( evt . l ocat i on) } . keyBy( "l ocati on") . t i m eW i ndow ( Ti m e. of ( 15, M I NUTES) , Ti m e. of ( 5, M I NUTES) ) . t ri gger( new Thr eshol d( 200) ) . sum ( "num Vehi cl es") . keyBy( evt = > evt . l ocat i on. gri d ) . m apW i t hSt at e { ( evt , st at e: O pt i on[ M odel ] ) = > { val m odel = st at e. orEl se(new M odel ( ) ) ( m odel . cl assi f y( evt ) , Som e( m odel . updat e( evt ) ) ) }}
  13. IoT / Mobile Applications 22 Events occur on devices Queue

    / Log Queue / Log Events analyzed in a data streaming system Stream Analysis Stream Analysis Events stored in a log
  14. IoT / Mobile Applications 26 Out of order !!! First

    burst of events Second burst of events
  15. IoT / Mobile Applications 27 Event time windows Arrival time

    windows Instant event-at-a-time Flink supports out of order time (event time) windows, arrival time windows (and mixtures) plus low latency processing. First burst of events Second burst of events
  16. High Availability and Consistency 28 No Single-Point-Of-Failure any more Exactly-once

    processing semantics across pipeline Checkpoints/Fault Tolerance is decoupled from windows  Allows for highly flexible window implementations ZooKeeper ensemble Multiple Masters failover
  17. Batch and Streaming 30 case cl ass W or dCount

    ( w or d: St ri ng, count : I nt ) val t ext : D at aSt r eam [ St ri ng] = … ; t ext . fl at M ap { l i ne = > l i ne. spl i t (" ") } . m ap { w or d = > new W or dCount ( w or d, 1) } . keyBy( "w ord") . w i ndow ( G l obal W i ndow s. cr eat e( ) ) . t ri gger( new EO FTri gger( ) ) . sum ( "count") Batch Word Count in the DataStream API
  18. Batch and Streaming 31 Batch Word Count in the DataSet

    API case cl ass W or dCount ( w or d: St ri ng, count : I nt ) val t ext : D at aSt r eam [ St ri ng] = … ; t ext . fl atM ap { l i ne = > l i ne. spl i t (" ") } . m ap { w or d = > new W or dCount ( w or d, 1) } . keyB y( "w ord") . w i ndow ( G l obal W i ndow s. cr eat e( ) ) . tri gger( new EO FTri gger( ) ) . sum ( "count") val t ext : D at aSet [ St ri ng] = … ; t ext . fl at M ap { l i ne = > l i ne. spl i t (" ") } . m ap { w or d = > new W or dCount ( w or d, 1) } . gr oupBy( "w ord") . sum ( "count")
  19. Batch and Streaming 32 Pipelined and blocking operators Streaming Dataflow

    Runtime Batch Parameters DataSet DataSet DataStream DataStream Relational Optimizer Relational Optimizer Window Optimization Window Optimization Pipelined and windowed operators Schedule lazily Schedule eagerly Recompute whole operators Periodic checkpoints Streaming data movement Stateful operations DAG recovery Fully bufered streams DAG resource management Streaming Parameters
  20. Batch and Streaming 34 A full-fledged batch processor as well

    See talk at Flink Forward 2015 by Dongwon Kim: "A comparative performance evaluation of Flink"
  21. Monitoring 36 Life system metrics and user-defined accumulators/statistics G et

    ht t p: / / fl i nk- m : 8081/ j obs/ 7684be6004e4e955c2a558a9bc463f 65/ accum ul at ors Monitoring REST API for custom monitoring tools { "i d": "dceaf e2df 1f 57a1206f cb907cb38ad97", "user - accum ul at ors": [ { "nam e": "avgl en", "t ype": "D oubl eCount er", "val ue": "123. 03259440000001" }, { "nam e": "genw or ds", "t ype": "LongCount er", "val ue": "75000000" } ] }
  22. Towards Flink 1.0  Flink 1.0 is around the corner

     Focus on defining public APIs and automatic API compatibility checks  Guarantee backwards compatibility in all Flink 1.X versions 38
  23. Beyond Flink 1.0  Flink engine has most features in

    place  Focus on usability features on top of DataStream API • e.g., SQL, ML, more connectors  Continue work on elasticity and memory management 39
  24. 41 tl;dr  Streaming is happening  Better adapt now

     Flink 0.10: a modern, ready to use open source stream processor
  25. 42  Read more • flink.apache.org/blog • data-artisans.com/blog  Subscribe

    to the mailing lists  Follow @ApacheFlink  Get involved at a local meetup