Konark Modi

About Me About Me Moved to Munich in Jan'2015. Currently
working as a Software Engineer with Cliqz : Cliqz is the new way to navigate the Internet. With CLIQZ for Firefox you're clicking directly, quickly and safely through the Web. Prior to this was working with one of the largest e-commerce websites in India.

Why do we care about a DataPipeline? Why do we
care about a DataPipeline? We are generating data at a rapid pace. Data sources are in abundance, diﬀerent formats, frequencies. Need to have a pro-active approach, gain insights as and when the data is being generated. Behaviour of the product / app needs to adapt to how the user has engaged in the past, and is engaging at the moment.

Data Platform architecture Data Platform architecture

Components Components ## Data ingestion Layer : ## Data ingestion
Layer : Tightly couples together with the streamprocessing layer Deployment model Data source reliability Multiple consumers Replay messages (Will cover in guranteed message processing) Data locality Example : Kafka ## Processing Layer(common patterns) ## Processing Layer(common patterns) Batch Microbatch Streaming ## Storage / Query layer ## Storage / Query layer Processed data served using cache Intermediate processing data Persist raw and processed data

Stream processing layer Stream processing layer Layer which not only
provides for real-time computation but an infrastructure for never ending continuous data processing

Stream processing layer Stream processing layer ## Challenges : ##
Challenges : Scalability: Ingest humdreds of millions of events per minute / day in real time. High degree of robustness Reliable data processing Fault-tolerance : Resilient to software and hardware failures and continue to meet SLA's Low latency : Site facing applications need response times in the order of milliseconds Partioning, Routing, Serialization Meter and gauges to what's going under the hood Control knobs Multi-lang support

Stream processing layer Stream processing layer ## Apache storm :
## Apache storm : High degree of robustness Reliable data processing Fault-tolerance Low latency Partioning, Routing, Serialization Meter and gauges to what's going under the hood Control knobs Multi-lang support

Components of storm cluster Components of storm cluster ## Conceptual
Level ## Conceptual Level Topologies Streams Spouts Bolts ## Physical Level ## Physical Level Nimbus ZoopKeeper Supervisor Storm UI ## Executing components of a topology : ## Executing components of a topology : Workers Tasks Executors

Spouts Spouts

In []: import itertools from streamparse.spout import Spout from websocket
import create_connection import json class webSocketSpout(Spout): def initialize(self, stormconf, context): self.ws = create_connection("ws://websocket.local.local:9000/") def next_tuple(self): result = self.ws.recv() json_object = json.loads(result) self.emit([result]) if __name__ == '__main__': webSocketSpout().run() ''' Sample event: { "action": "edit", "change_size": 328, "flags": "M", "hashtags": [], "is_anon": false, "is_bot": false, "is_minor": true, "is_new": false, "is_unpatrolled": false, "mentions": [], "ns": "Main", "page_title": "St. Andre Bessette Catholic Secondary School", "parent_rev_id": "663970563", "rev_id": "659207915", "summary": "replace with infobox school per TfD", "url": "http://en.wikipedia.org/w/index.php?diff=663970563&oldid=659207915", "user": "Frietjes" } '''

Streams Streams stream

Bolts Bolts

In []: from collections import Counter from streamparse.bolt import Bolt
from redis import StrictRedis import json import time class jsonParser(Bolt): def initialize(self, conf, ctx): self.counts = Counter() def process(self, tup): keys = [] json_object = json.loads(str(tup.values[0])) keys.append(["isanon_" + str(json_object["is_anon"])]) if json_object.get("is_anon"): keys.append(["anon_anon"]) else: keys.append(["anon_loggedin"]) if json_object.get("is_bot"): keys.append(["bot_bot"]) else: keys.append(["bot_human"]) keys.append(["action_" + json_object["action"]]) if json_object.get("geo_ip"): keys.append(["country_" + json_object["geo_ip"]["country_name"]]) self.emit_many(keys) class RedisBolt(Bolt): def initialize(self, conf, ctx): self.redis = StrictRedis(host="redishost") self.counter = Counter() def process(self, tup): keys, = tup.values key, word = keys.split("_") #self.log_count(word) if key == 'action': self.redis.zincrby("actions", str(word), 1) self.redis.zadd(key,int(time.time()), word) else: self.redis.zincrby(key, str(word), 1)

Topology Topology

In []: (ns wikipedialogs (:use [streamparse.specs]) (:gen-class)) (defn wikipedialogs [options]
[ ;; spout configuration {"websocket-spout" (python-spout-spec options "spouts.websocket.webSocketSpout" ["word"] :p 1 ) } ;; bolt configuration {"parser-bolt" (python-bolt-spec options {"websocket-spout" :shuffle} "bolts.bolts.jsonParser" ["word"] :p 2 ) ;; bolt configuration "redis-bolt" (python-bolt-spec options {"count-bolt" :shuffle} "bolts.bolts.RedisBolt" ;; does not emit any fields :p 2 ) } ] ) ## Submit topology # sparse run # sparse submit --name wikipedialogs

Grouping Grouping grouping Note: For detailed list please refer to
URL : https://storm.apache.org/documentation/Tutorial.html

Physical View Physical View Storm-Arch

Nimbus node : Manage, Monitor, coordinate topologies running on the
cluster. Deployment of topology. Task assignment and re-assignment in case of failure. ZooKeeper nodes – coordinates the Storm cluster Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus Fault tolerant

Executing components Executing components Workers, Tasks, Executors Image Source :
(https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm- topology.html) Oﬃcial docs

Guranteed message processing Guranteed message processing ### How does the
flow work ### How does the flow work Tuple Tree : Spout emits a touple, which goes to a bolt Bolt produces another tuple based on the previous one, the next bolt produces another set. A spout tuple is not considered fully complete until all the tuples in the tree have finished processing. Not completed withing a specified amount of time then replay the spout tuple. We can leverage the Reliability API by anchoring, which is essentially tagging the new tuple with input tuple. Special tasks dedicated, called ACKER Tasks. ### Scenarios ### Scenarios A tuple isn't acked because the task died. Acker task dies. Spout task dies.

STORM UI and CLI STORM UI and CLI Basic Cluster
/ Topology / Spout / Bolt level summary Useful to see performance Basic controls Rebalance cluster in storm, for changing parallelism

Other features and resources Other features and resources Trident topologies
DRPC Resource Managers : Storm-Yarn, Storm with mesos Running Apache Storm securley : https://github.com/apache/storm/blob/master/SECURITY.md Storm-Deploy : https://github.com/miguno/wirbelsturm Internal messaging in Apache Storm : Intra-worker communication : inter-thread on the same Storm node Inter-worker communication : node-to-node across the network Inter-topology or across cluster communication: nothing built into Storm Useful read : http://www.michael-noll.com/blog/2013/06/21/understanding-storm- internal-message-buﬀers/ streamparse : https://github.com/Parsely/streamparse Storm oﬃcial docs : https://storm.apache.org/

Other open-source alternatives Other open-source alternatives Samza S4 Linkedin Pinot
Twitter Heron

And just like a topology: "This topic is a never
ending discussion, catch me around for demo and more details". Thank you & Questions :)

Konark Modi

Konark Modi

MunichDataGeeks

More Decks by MunichDataGeeks

Other Decks in Programming

Featured

Transcript

About Me About Me Moved to Munich in Jan'2015. Currently

Why do we care about a DataPipeline? Why do we

Data Platform architecture Data Platform architecture

Data Platform architecture Data Platform architecture

Components Components ## Data ingestion Layer : ## Data ingestion

Stream processing layer Stream processing layer Layer which not only

Stream processing layer Stream processing layer ## Challenges : ##

Stream processing layer Stream processing layer ## Apache storm :

Components of storm cluster Components of storm cluster ## Conceptual

Spouts Spouts

In []: import itertools from streamparse.spout import Spout from websocket

Streams Streams stream

Bolts Bolts

In []: from collections import Counter from streamparse.bolt import Bolt

Topology Topology

In []: (ns wikipedialogs (:use [streamparse.specs]) (:gen-class)) (defn wikipedialogs [options]

Grouping Grouping grouping Note: For detailed list please refer to

Physical View Physical View Storm-Arch

Nimbus node : Manage, Monitor, coordinate topologies running on the

Executing components Executing components Workers, Tasks, Executors Image Source :

Guranteed message processing Guranteed message processing ### How does the

STORM UI and CLI STORM UI and CLI Basic Cluster

Other features and resources Other features and resources Trident topologies

Other open-source alternatives Other open-source alternatives Samza S4 Linkedin Pinot

And just like a topology: "This topic is a never