Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Stream Processing

Stream Processing

Data Warehousing.
Master in Innovation and Research in Informatics, UPC, Barcelona, 2014.

Christine Doig

June 11, 2014
Tweet

More Decks by Christine Doig

Other Decks in Programming

Transcript

  1. Objectives Christine Doig. Víctor Herrero. June 2014 1. Stream processing

    use cases 2. Open Source architectures 3. Products comparison
  2. Index 1. Motivation 2. Technical development 3. Products • Flume

    • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References ! Christine Doig. Víctor Herrero. June 2014
  3. Stream Processing 1. Motivation 2. Technical development 3. Products •

    Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References Christine Doig. Víctor Herrero. June 2014
  4. Performance is crucial Quite long process Constantly changing 1. Motivation

    – BI Architecture Christine Doig. Víctor Herrero. June 2014
  5. The DW is updated periodically 1. Motivation – BI Architecture

    Christine Doig. Víctor Herrero. June 2014 Source: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf
  6. Your DW is out-of-date 1. Motivation – BI Architecture Christine

    Doig. Víctor Herrero. June 2014 Source: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf
  7. Your DW is out-of-date Your DSS is out-of-date 1. Motivation

    – BI Architecture Christine Doig. Víctor Herrero. June 2014 Source: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf
  8. Your DW is out-of-date Your DSS is out-of-date No “real-time”

    accuracy 1. Motivation – BI Architecture Christine Doig. Víctor Herrero. June 2014 Source: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf
  9. 1. Motivation – Stream processing Logs Tweets … Ingestion Real-time

    Analytics Christine Doig. Víctor Herrero. June 2014
  10. 1. Motivation – Stream processing Logs Tweets … Ingestion Real-time

    Analytics Christine Doig. Víctor Herrero. June 2014
  11. 1. Motivation – Stream processing Logs Tweets … Ingestion Real-time

    Analytics Business strategy Christine Doig. Víctor Herrero. June 2014
  12. Stream Processing 1. Motivation 2. Technical development 3. Products •

    Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References Christine Doig. Víctor Herrero. June 2014
  13. 2. Lambda Architecture New data All data Batch layer Serving

    layer Batch view Batch view Batch view Speed layer Real-time view Real-time view Query Query Christine Doig. Víctor Herrero. June 2014
  14. 2. Lambda Architecture Logs Tweets … Ingestion Real-time Analytics Speed

    layer Batch layer Serving layer Analytics performed on View View View Christine Doig. Víctor Herrero. June 2014
  15. 2. Lambda Architecture Logs Tweets … Speed layer Batch layer

    Serving layer Analytics performed on View View View Christine Doig. Víctor Herrero. June 2014
  16. Stream Processing 1. Motivation 2. Technical development 3. Products •

    Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References ! Christine Doig. Víctor Herrero. June 2014
  17. Stream Processing 1. Motivation 2. Technical development 3. Products •

    Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References ! Christine Doig. Víctor Herrero. June 2014
  18. • Flume uses streaming data flows for efficiently collecting large

    amounts of data Data input Data collection Data output (storage) 3. Products - Flume Christine Doig. Víctor Herrero. June 2014
  19. • An event is a unit of data 3. Products

    - Flume Christine Doig. Víctor Herrero. June 2014
  20. • An event is a unit of data event 3.

    Products - Flume Christine Doig. Víctor Herrero. June 2014
  21. • An event is a unit of data • Events

    flow through one or more agents event 3. Products - Flume Christine Doig. Víctor Herrero. June 2014
  22. • An event is a unit of data • Events

    flow through one or more agents event agent agent storage 3. Products - Flume Christine Doig. Víctor Herrero. June 2014
  23. • An event is a unit of data • Events

    flow through one or more agents • An agent is a process composed by: – Sources – Channels – Sinks event agent agent storage 3. Products - Flume Christine Doig. Víctor Herrero. June 2014
  24. • An event is a unit of data • Events

    flow through one or more agents • An agent is a process composed by: – Sources – Channels – Sinks 3. Products - Flume Christine Doig. Víctor Herrero. June 2014
  25. 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

    HDFS Telnet Agent Source Channel Console Sink Sink
  26. 3. Products - Flume Christine Doig. Víctor Herrero. June 2014

    HDFS Telnet Agent Source Console Sink Sink Channel Channel
  27. myagent.sources = src myagent.channels = chan1 chan2 myagent.sinks = sink1

    sink2 
 3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet myagent src Console sink1 chan1 chan2 sink2
  28. myagent.sources = src myagent.channels = chan1 chan2 myagent.sinks = sink1

    sink2 ! myagent.sources.src.type = netcat
 myagent.sources.src.bind = localhost
 myagent.sources.src.port = 44444 
 3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet myagent src Console sink1 chan1 chan2 sink2
  29. myagent.sources = src myagent.channels = chan1 chan2 myagent.sinks = sink1

    sink2 ! myagent.sources.src.type = netcat
 myagent.sources.src.bind = localhost
 myagent.sources.src.port = 44444 ! myagent.sinks.sink1.type = logger ! myagent.sinks.sink2.type = hdfs
 myagent.sinks.sink2.hdfs.path = …
 3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet myagent src Console sink1 chan1 chan2 sink2
  30. myagent.sources = src myagent.channels = chan1 chan2 myagent.sinks = sink1

    sink2 ! myagent.sources.src.type = netcat
 myagent.sources.src.bind = localhost
 myagent.sources.src.port = 44444 ! myagent.sinks.sink1.type = logger ! myagent.sinks.sink2.type = hdfs
 myagent.sinks.sink2.hdfs.path = …
 ! myagent.channels.chan1.type = memory
 myagent.channels.chan2.type = memory 3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet myagent src Console sink1 chan1 chan2 sink2
  31. myagent.sources = src myagent.channels = chan1 chan2 myagent.sinks = sink1

    sink2 ! myagent.sources.src.type = netcat
 myagent.sources.src.bind = localhost
 myagent.sources.src.port = 44444 ! myagent.sinks.sink1.type = logger ! myagent.sinks.sink2.type = hdfs
 myagent.sinks.sink2.hdfs.path = …
 ! myagent.channels.chan1.type = memory
 myagent.channels.chan2.type = memory ! myagent.sources.src.channels = chan1 chan2
 myagent.sinks.sink1.channel = chan1
 myagent.sinks.sink2.channel = chan2 3. Products - Flume Christine Doig. Víctor Herrero. June 2014 HDFS Telnet myagent src Console sink1 chan1 chan2 sink2
  32. Stream Processing 1. Motivation 2. Technical development 3. Products •

    Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References Christine Doig. Víctor Herrero. June 2014
  33. 3. Products - Storm Storm is a distributed fault-tolerant real-time

    computation system. ! Use cases: -Stream processing: -Real-time analytics -Online machine learning -Distributed RPC -Continuous computation ! A Storm topology consumes streams of data and processes those streams in complex ways, repartitioning the streams between each stage of the computation however needed. ! Characteristics: Free and Open Source Scalable: Routing and partitioning of streams Fault-tolerant: Monitors and reassigns failed tasks Guarantees your data will be processed: Tracking tuple trees ! Christine Doig. Víctor Herrero. June 2014 Source: http://storm.incubator.apache.org/
  34. 3. Products - Storm. Architecture A Storm cluster has two

    kinds of nodes: ! •Master Node: Nimbus. Responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. •Worker Nodes: Supervisors. Listen for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster. ! Nimbus and Supervisors are: -fail-fast: process self-destructs whenever any unexpected situation is encountered -stateless: all state is kept in Zookeeper or on local disk. Christine Doig. Víctor Herrero. June 2014 Source: http://storm.incubator.apache.org/documentation/Tutorial.html
  35. 3. Products - Storm. Key Concepts (I) •Topology: Graph of

    computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around them. ! •Stream: A stream is an unbounded sequence of tuples. A tuple is a named list of values, and a field in a tuple can be an object of any type. ! The basic primitives Storm provides for doing stream transformations are “spouts” and “bolts”: •Spout: A spout is a source of streams. •Bolts: A bolt consumes any number of input streams, does some processing, and possibly emits new streams. Source: http://storm.incubator.apache.org/documentation/Tutorial.html Christine Doig. Víctor Herrero. June 2014
  36. 3. Products - Storm. Key Concepts (II) Christine Doig. Víctor

    Herrero. June 2014 Source: https://github.com/nathanmarz/storm/wiki/Concepts •Stream groupings: define how to send tuples from one set of tasks to another set of tasks. •Nodes (machines): These are simply machines configured to participate in a Storm cluster. •Workers (JVMs): These are independent JVM processes running on a node. •Executors (threads): These are Java threads running within a worker JVM process. •Tasks (bolt/spout instances): Tasks are instances of spouts and bolts. Source: Storm Blueprints: Patterns for Distributed Real-time Computation .
  37. 3. Products - Storm. Example Christine Doig. Víctor Herrero. June

    2014 Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
  38. 3. Products – Storm. Source: http://storm.incubator.apache.org/documentation/Tutorial.html Christine Doig. Víctor Herrero.

    June 2014 Guaranteeing message processing ! Storm’s basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system. Using Trident, a higher level abstraction over Storm’s basic abstractions, you can achieve exactly-once processing. ! ! ! ! ! ! Source: Storm Blueprints: Patterns for Distributed Real-time Computation . Each bolt in the tree can either acknowledge (ack) or fail a tuple. ! -> If all bolts in the tree acknowledge tuples derived from the trunk tuple, the spout's ack method will be called to indicate that message processing is complete. ! ->If any of the bolts in the tree explicitly fail a tuple, or if processing of the tuple tree exceeds the time-out period, the spout's fail method will be called.
  39. 3. Products – Storm. Source: http://storm.incubator.apache.org/documentation/Tutorial.html Christine Doig. Víctor Herrero.

    June 2014 Fault tolerance ! •Worker dies: the supervisor will restart it. If it fails, Nimbus will reassign the worker to another machine. •Node dies: Nimbus will reassign those tasks to other machines. •Nimbus or Supervisor dies: They restart. State is in Zookeeper ! What happens while Nimbus is down? Is Nimbus a Singular Point of Failure? If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won't be reassigned to other machines when necessary (like if you lose a worker machine). ! ! ! ! !
  40. Stream Processing 1. Motivation 2. Technical development 3. Products •

    Flume • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References ! Christine Doig. Víctor Herrero. June 2014
  41. 3. Products - Spark Streaming Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html Spark Streaming is

    an extension of the core Spark API that allows enables high- throughput, fault-tolerant stream processing of live data streams. Christine Doig. Víctor Herrero. June 2014
  42. 3. Products - Spark Streaming. Concepts Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html •RDD (Resilient

    Distributed Datasets): Distributed memory abstraction to perform in-memory computations on large clusters in a fault-tolerant manner by logging the transformations used to build a dataset (its lineage) rather than the actual data*. ! •DStream: sequence of RDDs represening a stream of data. Input DStream coming from Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets… ! ! ! ! •Transformations: modify data from one DStream to another. •Standard RDD operations: map, count by value, reduce, join •Stateful operations: window, countByValueAndWindow ! •Output operations: send data to external entity Christine Doig. Víctor Herrero. June 2014
  43. 3. Products – Storm vs Spark Streaming Stateful Stream Processing

    (e.g. Storm) Streaming computation as a series of very small, deterministic batch jobs. -Chop up the live stream into batches of X seconds ! Spark -Compute RDD from lineage -Treats each batch of data as RDDs and processes them using RDD operations Event driven record-at-a-time processing model: -Each node has mutable state -For each record, update state & send new records ! Storm -   Replays record if not processed by a node -   Processes each record at least once -   May update mutable state twice! -> Trident Discretized Stream Processing (e.g. Spark Streaming) Source: http://ampcamp.berkeley.edu/wp-content/uploads/2013/02/large-scale-near-real-time-stream-processing-tathagata-das-strata-2013.pdf Christine Doig. Víctor Herrero. June 2014
  44. Index 1. Motivation 2. Technical development 3. Products • Flume

    • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References
  45. 4. Conclusion. Product comparison Products: Main use case Collecting, aggregating

    and moving large amounts of log data Real-time analytics. Online machine learning. Stream processing of live data. Complex algorithms. Online machine learning Concepts Source, Channel, Sink Topologies, Spouts and bolts D-Streams, RDDs Model Move data to process. Message passing model Move process to data Fault-tolerance Replicated flow State in Zookeeper. Replicated database Input replication + lineage Language Java Java/Clojure Scala Christine Doig. Víctor Herrero. June 2014
  46. Index 1. Motivation 2. Technical development 3. Products • Flume

    • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References
  47. Index 1. Motivation 2. Technical development 3. Products • Flume

    • Storm • Spark Streaming 4. Conclusions 5. Demo 6. References
  48. 6. References Basic references Online References: •Lecture slides: https://learnsql.fib.upc.es/moodle/file.php/19/Slides/00-Introduction.pdf •Lambda

    architecture: http://lambda-architecture.net/ •Book (not fully released yet): Big Data – Principles and best practices of scalable real-time data systems: http://www.manning.com/marz/BDmeapch1.pdf ! Flume Online Documentation: http://flume.apache.org/ Real time Data Ingest into Hadoop using Flume: http://events.linuxfoundation.org/sites/ events/files/slides/RealTimeDataIngestUsingFlume.pdf ! Storm Online Documentation: http://storm.incubator.apache.org/ Source code: https://github.com/nathanmarz/storm Book: Storm Processing Cookbook by Quinton Anderson Tutorial: http://hortonworks.com/hadoop-tutorial/processing-streaming-data-near-real-time- apache-storm/ ! ! Christine Doig. Víctor Herrero. June 2014
  49. 6. References Spark Streaming: •Strata Conference 2013 presentation: http://ampcamp.berkeley.edu/wp-content/uploads/2013/02/large-scale-near-real-time-stream- processing-tathagata-das-strata-2013.pdf

    •Online Docs: http://spark.apache.org/docs/latest/streaming-programming-guide.html •Online Tutorial: http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html •Paper: http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf ! ! ! ! ! ! ! Christine Doig. Víctor Herrero. June 2014
  50. Christine Doig. Víctor Herrero. June 2014 Q&A ! Thank you

    for your attention! Stream Processing !