Stream Processing - Speaker Deck

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Objectives Christine Doig. Víctor Herrero. June 2014 1. Stream processing use cases 2. Open Source architectures 3. Products comparison

Slide 3

Slide 3 text

Index 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References ! Christine Doig. Víctor Herrero. June 2014

Slide 4

Slide 4 text

Stream Processing 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References Christine Doig. Víctor Herrero. June 2014

Slide 5

Slide 5 text

Quite long process 1. Motivation – BI Architecture Christine Doig. Víctor Herrero. June 2014

Slide 6

Slide 6 text

Quite long process Constantly changing 1. Motivation – BI Architecture Christine Doig. Víctor Herrero. June 2014

Slide 7

Slide 7 text

Performance is crucial Quite long process Constantly changing 1. Motivation – BI Architecture Christine Doig. Víctor Herrero. June 2014

Slide 8

Slide 8 text

The DW is updated periodically 1. Motivation – BI Architecture Christine Doig. Víctor Herrero. June 2014 Source: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf

Slide 9

Slide 9 text

Your DW is out-of-date 1. Motivation – BI Architecture Christine Doig. Víctor Herrero. June 2014 Source: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf

Slide 10

Slide 10 text

Your DW is out-of-date Your DSS is out-of-date 1. Motivation – BI Architecture Christine Doig. Víctor Herrero. June 2014 Source: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf

Slide 11

Slide 11 text

Your DW is out-of-date Your DSS is out-of-date No “real-time” accuracy 1. Motivation – BI Architecture Christine Doig. Víctor Herrero. June 2014 Source: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf

Slide 12

Slide 12 text

1. Motivation – Stream processing Christine Doig. Víctor Herrero. June 2014

Slide 13

Slide 13 text

1. Motivation – Stream processing Logs Tweets … Christine Doig. Víctor Herrero. June 2014

Slide 14

Slide 14 text

1. Motivation – Stream processing Logs Tweets … Ingestion Christine Doig. Víctor Herrero. June 2014

Slide 15

Slide 15 text

1. Motivation – Stream processing Logs Tweets … Ingestion Real-time Analytics Christine Doig. Víctor Herrero. June 2014

Slide 16

Slide 16 text

1. Motivation – Stream processing Logs Tweets … Ingestion Real-time Analytics Christine Doig. Víctor Herrero. June 2014

Slide 17

Slide 17 text

1. Motivation – Stream processing Logs Tweets … Ingestion Real-time Analytics Business strategy Christine Doig. Víctor Herrero. June 2014

Slide 18

Slide 18 text

Stream Processing 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References Christine Doig. Víctor Herrero. June 2014

Slide 19

Slide 19 text

2. Lambda Architecture New data All data Batch layer Serving layer Batch view Batch view Batch view Speed layer Real-time view Real-time view Query Query Christine Doig. Víctor Herrero. June 2014

Slide 20

Slide 20 text

2. Lambda Architecture Logs Tweets … Ingestion Real-time Analytics Speed layer Batch layer Serving layer Analytics performed on View View View Christine Doig. Víctor Herrero. June 2014

Slide 21

Slide 21 text

2. Lambda Architecture Logs Tweets … Speed layer Batch layer Serving layer Analytics performed on View View View Christine Doig. Víctor Herrero. June 2014

Slide 22

Slide 22 text

Stream Processing 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References ! Christine Doig. Víctor Herrero. June 2014

Slide 23

Slide 23 text

Stream Processing 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References ! Christine Doig. Víctor Herrero. June 2014

Slide 24

Slide 24 text

• Flume uses streaming data flows for efficiently collecting large amounts of data Data input Data collection Data output (storage) 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 25

Slide 25 text

3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 26

Slide 26 text

• An event is a unit of data 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 27

Slide 27 text

• An event is a unit of data event 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 28

Slide 28 text

• An event is a unit of data • Events flow through one or more agents event 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 29

Slide 29 text

• An event is a unit of data • Events flow through one or more agents event agent agent storage 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 30

Slide 30 text

• An event is a unit of data • Events flow through one or more agents • An agent is a process composed by: – Sources – Channels – Sinks event agent agent storage 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 31

Slide 31 text

• An event is a unit of data • Events flow through one or more agents • An agent is a process composed by: – Sources – Channels – Sinks 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 32

Slide 32 text

3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 33

Slide 33 text

3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 34

Slide 34 text

3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet Agent Console

Slide 35

Slide 35 text

3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet Agent Source Channel Console Sink Sink

Slide 36

Slide 36 text

3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet Agent Source Console Sink Sink Channel Channel

Slide 37

Slide 37 text

  3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet Agent Console

Slide 38

Slide 38 text

myagent.sources = src myagent.channels = chan1 chan2 myagent.sinks = sink1 sink2   3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet myagent src Console sink1 chan1 chan2 sink2

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

myagent.sources = src myagent.channels = chan1 chan2 myagent.sinks = sink1 sink2 ! myagent.sources.src.type = netcat  myagent.sources.src.bind = localhost  myagent.sources.src.port = 44444 ! myagent.sinks.sink1.type = logger ! myagent.sinks.sink2.type = hdfs  myagent.sinks.sink2.hdfs.path = …  ! myagent.channels.chan1.type = memory  myagent.channels.chan2.type = memory 3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet myagent src Console sink1 chan1 chan2 sink2

Slide 42

Slide 42 text

myagent.sources = src myagent.channels = chan1 chan2 myagent.sinks = sink1 sink2 ! myagent.sources.src.type = netcat  myagent.sources.src.bind = localhost  myagent.sources.src.port = 44444 ! myagent.sinks.sink1.type = logger ! myagent.sinks.sink2.type = hdfs  myagent.sinks.sink2.hdfs.path = …  ! myagent.channels.chan1.type = memory  myagent.channels.chan2.type = memory ! myagent.sources.src.channels = chan1 chan2  myagent.sinks.sink1.channel = chan1  myagent.sinks.sink2.channel = chan2 3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet myagent src Console sink1 chan1 chan2 sink2

Slide 43

Slide 43 text

3. Products - Flume Christine Doig. Víctor Herrero. June 2014

Slide 44

Slide 44 text

Stream Processing 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References Christine Doig. Víctor Herrero. June 2014

Slide 45

Slide 45 text

3. Products - Storm Storm is a distributed fault-tolerant real-time computation system. ! Use cases: -Stream processing: -Real-time analytics -Online machine learning -Distributed RPC -Continuous computation ! A Storm topology consumes streams of data and processes those streams in complex ways, repartitioning the streams between each stage of the computation however needed. ! Characteristics: Free and Open Source Scalable: Routing and partitioning of streams Fault-tolerant: Monitors and reassigns failed tasks Guarantees your data will be processed: Tracking tuple trees ! Christine Doig. Víctor Herrero. June 2014 Source: http://storm.incubator.apache.org/

Slide 46

Slide 46 text

3. Products - Storm. Architecture A Storm cluster has two kinds of nodes: ! •Master Node: Nimbus. Responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. •Worker Nodes: Supervisors. Listen for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster. ! Nimbus and Supervisors are: -fail-fast: process self-destructs whenever any unexpected situation is encountered -stateless: all state is kept in Zookeeper or on local disk. Christine Doig. Víctor Herrero. June 2014 Source: http://storm.incubator.apache.org/documentation/Tutorial.html

Slide 47

Slide 47 text

3. Products - Storm. Key Concepts (I) •Topology: Graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around them. ! •Stream: A stream is an unbounded sequence of tuples. A tuple is a named list of values, and a field in a tuple can be an object of any type. ! The basic primitives Storm provides for doing stream transformations are “spouts” and “bolts”: •Spout: A spout is a source of streams. •Bolts: A bolt consumes any number of input streams, does some processing, and possibly emits new streams. Source: http://storm.incubator.apache.org/documentation/Tutorial.html Christine Doig. Víctor Herrero. June 2014

Slide 48

Slide 48 text

3. Products - Storm. Key Concepts (II) Christine Doig. Víctor Herrero. June 2014 Source: https://github.com/nathanmarz/storm/wiki/Concepts •Stream groupings: define how to send tuples from one set of tasks to another set of tasks. •Nodes (machines): These are simply machines configured to participate in a Storm cluster. •Workers (JVMs): These are independent JVM processes running on a node. •Executors (threads): These are Java threads running within a worker JVM process. •Tasks (bolt/spout instances): Tasks are instances of spouts and bolts. Source: Storm Blueprints: Patterns for Distributed Real-time Computation .

Slide 49

Slide 49 text

3. Products - Storm. Example Christine Doig. Víctor Herrero. June 2014 Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

Slide 50

Slide 50 text

3. Products – Storm. Source: http://storm.incubator.apache.org/documentation/Tutorial.html Christine Doig. Víctor Herrero. June 2014 Guaranteeing message processing ! Storm’s basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system. Using Trident, a higher level abstraction over Storm’s basic abstractions, you can achieve exactly-once processing. ! ! ! ! ! ! Source: Storm Blueprints: Patterns for Distributed Real-time Computation . Each bolt in the tree can either acknowledge (ack) or fail a tuple. ! -> If all bolts in the tree acknowledge tuples derived from the trunk tuple, the spout's ack method will be called to indicate that message processing is complete. ! ->If any of the bolts in the tree explicitly fail a tuple, or if processing of the tuple tree exceeds the time-out period, the spout's fail method will be called.

Slide 51

Slide 51 text

3. Products – Storm. Source: http://storm.incubator.apache.org/documentation/Tutorial.html Christine Doig. Víctor Herrero. June 2014 Fault tolerance ! •Worker dies: the supervisor will restart it. If it fails, Nimbus will reassign the worker to another machine. •Node dies: Nimbus will reassign those tasks to other machines. •Nimbus or Supervisor dies: They restart. State is in Zookeeper ! What happens while Nimbus is down? Is Nimbus a Singular Point of Failure? If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won't be reassigned to other machines when necessary (like if you lose a worker machine). ! ! ! ! !

Slide 52

Slide 52 text

Stream Processing 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References ! Christine Doig. Víctor Herrero. June 2014

Slide 53

Slide 53 text

3. Products - Spark Streaming Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html Spark Streaming is an extension of the core Spark API that allows enables high- throughput, fault-tolerant stream processing of live data streams. Christine Doig. Víctor Herrero. June 2014

Slide 54

Slide 54 text

3. Products - Spark Streaming. Concepts Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html •RDD (Resilient Distributed Datasets): Distributed memory abstraction to perform in-memory computations on large clusters in a fault-tolerant manner by logging the transformations used to build a dataset (its lineage) rather than the actual data*. ! •DStream: sequence of RDDs represening a stream of data. Input DStream coming from Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets… ! ! ! ! •Transformations: modify data from one DStream to another. •Standard RDD operations: map, count by value, reduce, join •Stateful operations: window, countByValueAndWindow ! •Output operations: send data to external entity Christine Doig. Víctor Herrero. June 2014

Slide 55

Slide 55 text

3. Products - Spark Streaming. Example Source: http://ampcamp.berkeley.edu/wp-content/uploads/2013/02/large-scale-near-real-time-stream-processing-tathagata-das-strata-2013.pdf Christine Doig. Víctor Herrero. June 2014

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Slide 62

Slide 62 text

3. Products – Storm vs Spark Streaming Stateful Stream Processing (e.g. Storm) Streaming computation as a series of very small, deterministic batch jobs. -Chop up the live stream into batches of X seconds ! Spark -Compute RDD from lineage -Treats each batch of data as RDDs and processes them using RDD operations Event driven record-at-a-time processing model: -Each node has mutable state -For each record, update state & send new records ! Storm -   Replays record if not processed by a node -   Processes each record at least once -   May update mutable state twice! -> Trident Discretized Stream Processing (e.g. Spark Streaming) Source: http://ampcamp.berkeley.edu/wp-content/uploads/2013/02/large-scale-near-real-time-stream-processing-tathagata-das-strata-2013.pdf Christine Doig. Víctor Herrero. June 2014

Slide 63

Slide 63 text

Index 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References

Slide 64

Slide 64 text

4. Conclusion. Product comparison Products: Main use case Collecting, aggregating and moving large amounts of log data Real-time analytics. Online machine learning. Stream processing of live data. Complex algorithms. Online machine learning Concepts Source, Channel, Sink Topologies, Spouts and bolts D-Streams, RDDs Model Move data to process. Message passing model Move process to data Fault-tolerance Replicated flow State in Zookeeper. Replicated database Input replication + lineage Language Java Java/Clojure Scala Christine Doig. Víctor Herrero. June 2014

Slide 65

Slide 65 text

Index 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References

Slide 66

Slide 66 text

5. Demo Christine Doig. Víctor Herrero. June 2014 DEMO

Slide 67

Slide 67 text

Index 1. Motivation 2. Technical development 3. Products • Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References

Slide 68

Slide 68 text

6. References Basic references Online References: •Lecture slides: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf •Lambda architecture: http://lambda-architecture.net/ •Book (not fully released yet): Big Data – Principles and best practices of scalable real-time data systems: http://www.manning.com/marz/BDmeapch1.pdf ! Flume Online Documentation: http://flume.apache.org/ Real time Data Ingest into Hadoop using Flume: http://events.linuxfoundation.org/sites/ events/files/slides/RealTimeDataIngestUsingFlume.pdf ! Storm Online Documentation: http://storm.incubator.apache.org/ Source code: https://github.com/nathanmarz/storm Book: Storm Processing Cookbook by Quinton Anderson Tutorial: http://hortonworks.com/hadoop-tutorial/processing-streaming-data-near-real-time- apache-storm/ ! ! Christine Doig. Víctor Herrero. June 2014

Slide 69

Slide 69 text

6. References Spark Streaming: •Strata Conference 2013 presentation: http://ampcamp.berkeley.edu/wp-content/uploads/2013/02/large-scale-near-real-time-stream- processing-tathagata-das-strata-2013.pdf •Online Docs: http://spark.apache.org/docs/latest/streaming-programming-guide.html •Online Tutorial: http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html •Paper: http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf ! ! ! ! ! ! ! Christine Doig. Víctor Herrero. June 2014

Slide 70

Slide 70 text

Christine Doig. Víctor Herrero. June 2014 Q&A ! Thank you for your attention! Stream Processing !