Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Konark Modi

Konark Modi

Apache Storm: Designing NRT(NearRealTime) stream processing system

MunichDataGeeks

June 25, 2015
Tweet

More Decks by MunichDataGeeks

Other Decks in Programming

Transcript

  1. About Me About Me Moved to Munich in Jan'2015. Currently

    working as a Software Engineer with Cliqz : Cliqz is the new way to navigate the Internet. With CLIQZ for Firefox you're clicking directly, quickly and safely through the Web. Prior to this was working with one of the largest e-commerce websites in India.
  2. Why do we care about a DataPipeline? Why do we

    care about a DataPipeline? We are generating data at a rapid pace. Data sources are in abundance, different formats, frequencies. Need to have a pro-active approach, gain insights as and when the data is being generated. Behaviour of the product / app needs to adapt to how the user has engaged in the past, and is engaging at the moment.
  3. Components Components ## Data ingestion Layer : ## Data ingestion

    Layer : Tightly couples together with the streamprocessing layer Deployment model Data source reliability Multiple consumers Replay messages (Will cover in guranteed message processing) Data locality Example : Kafka ## Processing Layer(common patterns) ## Processing Layer(common patterns) Batch Microbatch Streaming ## Storage / Query layer ## Storage / Query layer Processed data served using cache Intermediate processing data Persist raw and processed data
  4. Stream processing layer Stream processing layer Layer which not only

    provides for real-time computation but an infrastructure for never ending continuous data processing
  5. Stream processing layer Stream processing layer ## Challenges : ##

    Challenges : Scalability: Ingest humdreds of millions of events per minute / day in real time. High degree of robustness Reliable data processing Fault-tolerance : Resilient to software and hardware failures and continue to meet SLA's Low latency : Site facing applications need response times in the order of milliseconds Partioning, Routing, Serialization Meter and gauges to what's going under the hood Control knobs Multi-lang support
  6. Stream processing layer Stream processing layer ## Apache storm :

    ## Apache storm : High degree of robustness Reliable data processing Fault-tolerance Low latency Partioning, Routing, Serialization Meter and gauges to what's going under the hood Control knobs Multi-lang support
  7. Components of storm cluster Components of storm cluster ## Conceptual

    Level ## Conceptual Level Topologies Streams Spouts Bolts ## Physical Level ## Physical Level Nimbus ZoopKeeper Supervisor Storm UI ## Executing components of a topology : ## Executing components of a topology : Workers Tasks Executors
  8. In []: import itertools from streamparse.spout import Spout from websocket

    import create_connection import json class webSocketSpout(Spout): def initialize(self, stormconf, context): self.ws = create_connection("ws://websocket.local.local:9000/") def next_tuple(self): result = self.ws.recv() json_object = json.loads(result) self.emit([result]) if __name__ == '__main__': webSocketSpout().run() ''' Sample event: { "action": "edit", "change_size": 328, "flags": "M", "hashtags": [], "is_anon": false, "is_bot": false, "is_minor": true, "is_new": false, "is_unpatrolled": false, "mentions": [], "ns": "Main", "page_title": "St. Andre Bessette Catholic Secondary School", "parent_rev_id": "663970563", "rev_id": "659207915", "summary": "replace with infobox school per TfD", "url": "http://en.wikipedia.org/w/index.php?diff=663970563&oldid=659207915", "user": "Frietjes" } '''
  9. In []: from collections import Counter from streamparse.bolt import Bolt

    from redis import StrictRedis import json import time class jsonParser(Bolt): def initialize(self, conf, ctx): self.counts = Counter() def process(self, tup): keys = [] json_object = json.loads(str(tup.values[0])) keys.append(["isanon_" + str(json_object["is_anon"])]) if json_object.get("is_anon"): keys.append(["anon_anon"]) else: keys.append(["anon_loggedin"]) if json_object.get("is_bot"): keys.append(["bot_bot"]) else: keys.append(["bot_human"]) keys.append(["action_" + json_object["action"]]) if json_object.get("geo_ip"): keys.append(["country_" + json_object["geo_ip"]["country_name"]]) self.emit_many(keys) class RedisBolt(Bolt): def initialize(self, conf, ctx): self.redis = StrictRedis(host="redishost") self.counter = Counter() def process(self, tup): keys, = tup.values key, word = keys.split("_") #self.log_count(word) if key == 'action': self.redis.zincrby("actions", str(word), 1) self.redis.zadd(key,int(time.time()), word) else: self.redis.zincrby(key, str(word), 1)
  10. In []: (ns wikipedialogs (:use [streamparse.specs]) (:gen-class)) (defn wikipedialogs [options]

    [ ;; spout configuration {"websocket-spout" (python-spout-spec options "spouts.websocket.webSocketSpout" ["word"] :p 1 ) } ;; bolt configuration {"parser-bolt" (python-bolt-spec options {"websocket-spout" :shuffle} "bolts.bolts.jsonParser" ["word"] :p 2 ) ;; bolt configuration "redis-bolt" (python-bolt-spec options {"count-bolt" :shuffle} "bolts.bolts.RedisBolt" ;; does not emit any fields :p 2 ) } ] ) ## Submit topology # sparse run # sparse submit --name wikipedialogs
  11. Grouping Grouping grouping Note: For detailed list please refer to

    URL : https://storm.apache.org/documentation/Tutorial.html
  12. Nimbus node : Manage, Monitor, coordinate topologies running on the

    cluster. Deployment of topology. Task assignment and re-assignment in case of failure. ZooKeeper nodes – coordinates the Storm cluster Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus Fault tolerant
  13. Executing components Executing components Workers, Tasks, Executors Image Source :

    (https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm- topology.html) Official docs
  14. Guranteed message processing Guranteed message processing ### How does the

    flow work ### How does the flow work Tuple Tree : Spout emits a touple, which goes to a bolt Bolt produces another tuple based on the previous one, the next bolt produces another set. A spout tuple is not considered fully complete until all the tuples in the tree have finished processing. Not completed withing a specified amount of time then replay the spout tuple. We can leverage the Reliability API by anchoring, which is essentially tagging the new tuple with input tuple. Special tasks dedicated, called ACKER Tasks. ### Scenarios ### Scenarios A tuple isn't acked because the task died. Acker task dies. Spout task dies.
  15. STORM UI and CLI STORM UI and CLI Basic Cluster

    / Topology / Spout / Bolt level summary Useful to see performance Basic controls Rebalance cluster in storm, for changing parallelism
  16. Other features and resources Other features and resources Trident topologies

    DRPC Resource Managers : Storm-Yarn, Storm with mesos Running Apache Storm securley : https://github.com/apache/storm/blob/master/SECURITY.md Storm-Deploy : https://github.com/miguno/wirbelsturm Internal messaging in Apache Storm : Intra-worker communication : inter-thread on the same Storm node Inter-worker communication : node-to-node across the network Inter-topology or across cluster communication: nothing built into Storm Useful read : http://www.michael-noll.com/blog/2013/06/21/understanding-storm- internal-message-buffers/ streamparse : https://github.com/Parsely/streamparse Storm official docs : https://storm.apache.org/
  17. And just like a topology: "This topic is a never

    ending discussion, catch me around for demo and more details". Thank you & Questions :)