Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Storm Introduction

Storm Introduction

An introduction to realtime bigdata using Storm

datacrunchers

May 24, 2012
Tweet

More Decks by datacrunchers

Other Decks in Technology

Transcript

  1. Storm Introduction BigData Processing MapReduce - Run-Once apps - Batch

    New Needs - More Flexibility - Incremental Processing Realtime does not replace batch! - Lambda Architecture 2 Thursday 24 May 12
  2. Storm Introduction Realtime Processing But - We are talking BigData

    - We want it to be • a solution, not a set of components • stable • scalable • recognizable 5 Thursday 24 May 12
  3. Storm Introduction Storm 7 Created at Twitter (BackType) - Analyzing

    the twitter graph Provides - Scalability - Reliability - Flexibility Written in Java & Clojure Thursday 24 May 12
  4. Storm Introduction Storm - Scalable 8 Scalable By Design Add

    more machines if needed Example: - 1M msg/s on a 10 node cluster - including hundreds of DB calls per second Thursday 24 May 12
  5. Storm Introduction Storm - Reliable Guarantees no data loss -

    every message will be processed Fault-tolerant - Reassigns tasks if necessary Transactional - using batches 9 Thursday 24 May 12
  6. Storm Introduction Storm - Flexible Lots of use cases -

    Stream processing - Continuous computation - Distributed Remote Procedure Calls Just Works - Great scripts - Storm-deploy project for EC2 10 Thursday 24 May 12
  7. Storm Introduction Design - Nimbus Manages the cluster - You

    submit Jar to Nimbus - Nimbus distributes the code around the cluster Use `storm` client to communicate - only for remote clusters - deploy new topologies - kill topologies - ... 13 Thursday 24 May 12
  8. Storm Introduction Design - Zookeeper Used for Coordination NOT used

    for message passing Single node Quorum sufficient for most cases Watch out! - Fails fast • Use monitoring software - Keeps growing • Cron job to compact data and logs 15 Thursday 24 May 12
  9. Storm Introduction Design - Worker Node 16 Nimbus Zookeeper Supervisor

    Worker Worker Worker Node Thursday 24 May 12
  10. Storm Introduction Design - Worker Physical Java VM Executes Tasks

    Tasks are spread evenly across workers Every worker uses a port - Starts at 6700 - configurable Multiple workers per machine - defaults to 4 - configurable 18 Thursday 24 May 12
  11. Storm Introduction Design - Worker 19 Nimbus Zookeeper Supervisor Worker

    Task Task ... Task Task Worker Node Thursday 24 May 12
  12. Storm Introduction Design - Task One thread within a worker

    JVM Executes a spout or bolt Several tasks for one spout/bolt - configured when defining topology 20 Thursday 24 May 12
  13. Storm Introduction Concepts - Tuple Named list of values Dynamic

    Typed Needs to know how to serialize each value type - Extendable with custom serializers - Java serialization by default (Slow!) 23 Thursday 24 May 12
  14. Storm Introduction Concepts - Stream Sequence of tuples - Identified

    - Defined with a Schema 24 Thursday 24 May 12
  15. Storm Introduction Concepts - Spout Source of streams Reliable vs

    Unreliable - replay-able tuples vs fire-and-forget 1+ streams per spout 25 Thursday 24 May 12
  16. Storm Introduction Concepts - Bolt Simple stream transformations 1+ input

    streams 0+ output streams 26 Thursday 24 May 12
  17. Storm Introduction Concepts - Grouping Partitioning of a stream over

    bolt tasks 7 different groupings out of the box Write your own 27 Thursday 24 May 12