Slide 1

Slide 1 text

streamparse Defeat the Python GIL with Apache Storm. Andrew Montalenti, CTO 1 of 75

Slide 2

Slide 2 text

About Me CTO/co-founder of Hacking in Python for over a decade Fully distributed team @amontalenti on Twitter: 2 of 75

Slide 3

Slide 3 text

Python GIL Python's GIL does not allow true multi-thread parallelism: And on multi-core, it even leads to lock contention: @dabeaz discussed this in a Friday talk on concurrency. 3 of 75

Slide 4

Slide 4 text

Queues and workers Standard way to solve GIL woes. Queues: ZeroMQ => Redis => RabbitMQ Workers: Cron Jobs => RQ => Celery 4 of 75

Slide 5

Slide 5 text Architecture, 2012 5 of 75

Slide 6

Slide 6 text

It started to get messy 6 of 75

Slide 7

Slide 7 text

As Hettinger Says... "There must be a better way..." 7 of 75

Slide 8

Slide 8 text

What is this Storm thing? We read: "Storm is a distributed real-time computation system." Dramatically simplifies your workers and queues. "Great," we thought. "But, what about Python support?" That's what streamparse is about. 8 of 75

Slide 9

Slide 9 text

Our Storm Use Case 9 of 75

Slide 10

Slide 10 text

What is Web content analytics for digital storytellers. Some of our customers: 10 of 75

Slide 11

Slide 11 text

Elegant data dashboards Informing thousands of editors and writers every day: 11 of 75

Slide 12

Slide 12 text

Powerful data APIs Powering billions of site visits every month: 12 of 75

Slide 13

Slide 13 text

Too many datas! 13 of 75

Slide 14

Slide 14 text

"Python Can't Do This" "Free lunch is over." "It can't scale." "It's a toy language." "Shoulda used Scala." 14 of 75

Slide 15

Slide 15 text

Python Can't Scale? Eat that, haters! 15 of 75

Slide 16

Slide 16 text

Thanks to Storm 16 of 75

Slide 17

Slide 17 text

streamparse is Pythonic Storm streamparse lets you parse real-time streams of data. It smoothly integrates Python code with Apache Storm. Easy quickstart, good CLI/tooling, production tested. Good for: Analytics, Logs, Sensors, Low-Latency Stuff. 17 of 75

Slide 18

Slide 18 text

Agenda Storm topology concepts Storm internals How does Python work with Storm? streamparse overview pykafka preview Slides on Twitter; follow @amontalenti. Slides: Notes: 18 of 75

Slide 19

Slide 19 text

Storm Topology Concepts 19 of 75

Slide 20

Slide 20 text

Storm Abstractions Storm provides abstractions for data processing: Tuple Spout Bolt Topology 20 of 75

Slide 21

Slide 21 text

Wired Topology 21 of 75

Slide 22

Slide 22 text

WARNING Using Python pseudocode and coroutines! 22 of 75

Slide 23

Slide 23 text

Tuple A single data record that flows through your cluster. # tuple spec: ["word"] word = ("dog",) # tuple spec: ["word", "count"] word_count = ("dog", 4) 23 of 75

Slide 24

Slide 24 text

Spout A component that emits raw data into cluster. class Spout(object): def next_tuple(): """Called repeatedly to emit tuples.""" @coroutine def spout_coroutine(spout, target): """Get tuple from spout and send it to target.""" while True: tup = spout.next_tuple() if tup is None: time.sleep(10) continue if target is not None: target.send(tup) 24 of 75

Slide 25

Slide 25 text

Bolt A component that implements one processing stage. class Bolt(object): def process(tup): """Called repeatedly to process tuples.""" @coroutine def bolt_coroutine(bolt, target): """Get tuple from input, process it in Bolt. Then send it to next bolt target, if it exists.""" while True: tup = (yield) if tup is None: time.sleep(10) continue to_emit = bolt.process(tup) if target is not None: target.send(to_emit) 25 of 75

Slide 26

Slide 26 text

Topology Directed Acyclic Graph (DAG) describing it all. # lay out topology spout = WordSpout bolts = [WordCountBolt, DebugPrintBolt] # wire topology topology = wire(spout=spout, bolts=bolts) # start the topology next(topology) 26 of 75

Slide 27

Slide 27 text

Storm Internals 27 of 75

Slide 28

Slide 28 text

Tuple Tree 28 of 75

Slide 29

Slide 29 text

Streams, Grouping and Parallelism X word-spout word-count-bolt input None word-spout output word-count-bolt None tuple ("dog",) ("dog", 4") stream ["word"] ["word", "count"] grouping ["word"] ":shuffle" parallelism 2 8 29 of 75

Slide 30

Slide 30 text

Nimbus and Storm UI 30 of 75

Slide 31

Slide 31 text

Workers and Zookeeper 31 of 75

Slide 32

Slide 32 text

Empty Slots 32 of 75

Slide 33

Slide 33 text

Filled Slots and Rebalancing 33 of 75

Slide 34

Slide 34 text

BTW, Buy This Book! Source of these diagrams. Storm Applied, by Manning Press. Reviewed in Storm, The Big Reference. 34 of 75

Slide 35

Slide 35 text

Network Transfer 35 of 75

Slide 36

Slide 36 text

So, Storm is Sorta Amazing! Storm... will guarantee processing via tuple trees does tuneable parallelism per component implements a high availability model allocates Python process slots on physical nodes helps us rebalance computation across cluster handles network messaging automatically And, it beats the GIL! 36 of 75

Slide 37

Slide 37 text

Let's Do This! 37 of 75

Slide 38

Slide 38 text

Getting Python on Storm 38 of 75

Slide 39

Slide 39 text

Multi-Lang Protocol (1) Storm supports Python through the multi-lang protocol. JSON protocol Works via shell-based components Communicate over STDIN and STDOUT Clean, UNIX-y. Can use CPython, PyPy; no need for Jython or Py4J. Kinda quirky, but also relatively simple to implement. 39 of 75

Slide 40

Slide 40 text

Multi-Lang Protocol (2) Each component of a "Python" Storm topology is either: ShellSpout ShellBolt Java implementations speak to Python via light JSON. There's one sub-process per Storm task. If p = 8, then 8 Python processes are spawned. 40 of 75

Slide 41

Slide 41 text

Multi-Lang Protocol (3) INIT: JVM => Python >JSON XFER: JVM => JVM >Kryo DATA: JVM => Python >JSON EMIT: Python => JVM >JSON XFER: JVM => JVM >Kryo ACK: Python => JVM >JSON BEAT: JVM => Python >JSON SYNC: Python => JVM >JSON 41 of 75

Slide 42

Slide 42 text issues Storm bundles "" (a multi-lang implementation). But, it's not Pythonic. We'll fix that, we thought! 42 of 75

Slide 43

Slide 43 text

Storm as Infrastructure Thought: Storm should be like Cassandra/Elasticsearch. "Written in Java, but Pythonic nonetheless." Need: Python as a first-class citizen. Must also fix "Javanonic" bits (e.g. packaging). 43 of 75

Slide 44

Slide 44 text

streamparse overview 44 of 75

Slide 45

Slide 45 text

Enter streamparse Initial release Apr 2014; one year of active development. 600+ stars on Github, was a trending repo in May 2014. 90+ mailing list members and 5 new committers. 3 engineers maintaining it. Funding from DARPA. (Yes, really!) 45 of 75

Slide 46

Slide 46 text

streamparse CLI sparse provides a CLI front-end to streamparse, a framework for creating Python projects for running, debugging, and submitting Storm topologies for data processing. After installing the lein (only dependency), you can run: pip install streamparse This will offer a command-line tool, sparse. Use: sparse quickstart 46 of 75

Slide 47

Slide 47 text

Running and debugging You can then run the local Storm topology using: $ sparse run Running wordcount topology... Options: {:spec "topologies/wordcount.clj", ...} #

Slide 48

Slide 48 text

Submitting to remote cluster Single command: $ sparse submit Does all the following magic: Makes virtualenvs across cluster Builds a JAR out of your source code Opens reverse tunnel to Nimbus Constructs an in-memory Topology spec Uploads JAR to Nimbus 48 of 75

Slide 49

Slide 49 text

streamparse supplants 49 of 75

Slide 50

Slide 50 text

Let's Make a Topology! 50 of 75

Slide 51

Slide 51 text

Word Stream Spout (Storm DSL) {"word-spout" (python-spout-spec options "spouts.words.WordSpout" ; class (spout) ["word"] ; stream (fields) ) } 51 of 75

Slide 52

Slide 52 text

Word Stream Spout in Python import itertools from streamparse.spout import Spout class Words(Spout): def initialize(self, conf, ctx): self.words = itertools.cycle(['dog', 'cat', 'zebra', 'elephant']) def next_tuple(self): word = next(self.words) self.emit([word]) Emits one-word tuples from endless generator. 52 of 75

Slide 53

Slide 53 text

Word Count Bolt (Storm DSL) {"word-count-bolt" (python-bolt-spec options {"word-spout" ["word"]} ; input (grouping) "bolts.wordcount.WordCount" ; class (bolt) ["word" "count"] ; stream (fields) :p 2 ; parallelism ) } 53 of 75

Slide 54

Slide 54 text

Word Count Bolt in Python from collections import Counter from streamparse.bolt import Bolt class WordCount(Bolt): def initialize(self, conf, ctx): self.counts = Counter() def process(self, tup): word = tup.values[0] self.counts[word] += 1 self.log('%s: %d' % (word, self.counts[word])) Keeps word counts in-memory (assumes grouping). 54 of 75

Slide 55

Slide 55 text

BatchingBolt for Performance from streamparse.bolt import BatchingBolt class WordCount(BatchingBolt): secs_between_batches = 5 def group_key(self, tup): # collect batches of words word = tup.values[0] return word def process_batch(self, key, tups): # emit the count of words we had per 5s batch self.emit([key, len(tups)]) Implements 5-second micro-batches. 55 of 75

Slide 56

Slide 56 text

streamparse config.json { "envs": { "0.8": { "user": "ubuntu", "nimbus": "", "workers": ["", ""], "log_path": "/var/log/ubuntu/storm", "virtualenv_root": "/data/virtualenvs" }, "vagrant": { "user": "ubuntu", "nimbus": "vagrant.local", "workers": ["vagrant.local"], "log_path": "/home/ubuntu/storm/logs", "virtualenv_root": "/home/ubuntu/virtualenvs" } } } 56 of 75

Slide 57

Slide 57 text

sparse options $ sparse help Usage: sparse quickstart sparse run [-o ]... [-p ] [-t

Slide 58

Slide 58 text

pykafka preview 58 of 75

Slide 59

Slide 59 text

Apache Kafka "Messaging rethought as a commit log." Distributed tail -f. Perfect fit for Storm Spouts. Able to keep up with Storm's high-throughput processing. Great for handling backpressure during traffic spikes. 59 of 75

Slide 60

Slide 60 text

pykafka We have released pykafka. NOT to be confused with kafka-python. Upgraded internal Kafka 0.7 driver to 0.8.2: SimpleConsumer and BalancedConsumer Consumer Groups with Zookeeper Pure Python protocol implementation C protocol implementation in works (via librdkafka) 60 of 75

Slide 61

Slide 61 text

Questions? I'm sprinting on a Python Storm Topology DSL. Hacking on Monday and Tuesday. Join me! streamparse:'s hiring: Find me on Twitter: That's it! 61 of 75

Slide 62

Slide 62 text

Appendix 62 of 75

Slide 63

Slide 63 text

Storm and Spark Together 63 of 75

Slide 64

Slide 64 text

Overall Architecture 64 of 75

Slide 65

Slide 65 text

Multi-Lang Impl's in Python (Storm, 2010) Petrel (AirSage, Dec 2012) streamparse (, Apr 2014) pyleus (Yelp, Oct 2014) Plans to unify IPC implementations around pystorm. 65 of 75

Slide 66

Slide 66 text

Other Related Projects lein - Clojure dependency manager used by streamparse flux - YAML Topology runner Clojure DSL - Topology DSL, bundled with Storm Trident - Java "high-level" DSL, bundled with Storm streamparse uses lein and a simplified Clojure DSL. Will add a Python DSL in 2.x. 66 of 75

Slide 67

Slide 67 text

Topology Wiring def wire(spout, bolts=[]): """Wire the components together in a pipeline. Return the spout coroutine that kicks it off.""" last, target = None, None for bolt in reversed(bolts): step = bolt_coroutine(bolt) if last is None: last = step continue else: step = bolt_coroutine(bolt, target=last) last = step return spout_coroutine(spout, target=last) 67 of 75

Slide 68

Slide 68 text

Streams, Grouping, Parallelism (still pseudocode) class WordCount(Topology): spouts = [ WordSpout( name="word-spout", out=["word"], p=2) ] bolts = [ WordCountBolt( name="word-count-bolt", from=WordSpout, group_on="word", out=["word", "count"], p=8) ] 68 of 75

Slide 69

Slide 69 text

Storm is "Javanonic" Ironic term one of my engineers came up with for a project that feels very Java-like, and not very "Pythonic". 69 of 75

Slide 70

Slide 70 text

Storm Java Quirks Topology Java builder API (eek). Projects built with Maven tasks (yuck). Deployment needs a JAR of your code (ugh). No simple local dev workflow built-in (boo). Storm uses Thrift interfaces (shrug). 70 of 75

Slide 71

Slide 71 text

Multi-Lang Protocol The multi-lang protocol has the full core: ack fail emit anchor log heartbeat tuple tree 71 of 75

Slide 72

Slide 72 text

Kafka and Multi-consumer 72 of 75

Slide 73

Slide 73 text

Kafka Consumer Groups 73 of 75

Slide 74

Slide 74 text

Bolts for Real-Time ETL 74 of 75

Slide 75

Slide 75 text

streamparse projects 75 of 75