Andrew Montalenti - streamparse: real-time streams with Python and Apache Storm

Andrew Montalenti - streamparse: real-time streams with Python and Apache Storm

Real-time streams are everywhere, but does Python have a good way of processing them? Until recently, there were no good options. A new open source project, streamparse, makes working with real-time data streams easy for Pythonistas. If you have ever wondered how to process 10,000 data tuples per second with Python -- while maintaining high availability and low latency -- this talk is for you.


PyCon 2015

April 18, 2015


  1. streamparse Defeat the Python GIL with Apache Storm. Andrew Montalenti,

    CTO 1 of 75
  2. About Me CTO/co-founder of Hacking in Python for over

    a decade Fully distributed team @amontalenti on Twitter: 2 of 75
  3. Python GIL Python's GIL does not allow true multi-thread parallelism:

    And on multi-core, it even leads to lock contention: @dabeaz discussed this in a Friday talk on concurrency. 3 of 75
  4. Queues and workers Standard way to solve GIL woes. Queues:

    ZeroMQ => Redis => RabbitMQ Workers: Cron Jobs => RQ => Celery 4 of 75
  5. Architecture, 2012 5 of 75

  6. It started to get messy 6 of 75

  7. As Hettinger Says... "There must be a better way..." 7

    of 75
  8. What is this Storm thing? We read: "Storm is a

    distributed real-time computation system." Dramatically simplifies your workers and queues. "Great," we thought. "But, what about Python support?" That's what streamparse is about. 8 of 75
  9. Our Storm Use Case 9 of 75

  10. What is Web content analytics for digital storytellers. Some

    of our customers: 10 of 75
  11. Elegant data dashboards Informing thousands of editors and writers every

    day: 11 of 75
  12. Powerful data APIs Powering billions of site visits every month:

    12 of 75
  13. Too many datas! 13 of 75

  14. "Python Can't Do This" "Free lunch is over." "It can't

    scale." "It's a toy language." "Shoulda used Scala." 14 of 75
  15. Python Can't Scale? Eat that, haters! 15 of 75

  16. Thanks to Storm 16 of 75

  17. streamparse is Pythonic Storm streamparse lets you parse real-time streams

    of data. It smoothly integrates Python code with Apache Storm. Easy quickstart, good CLI/tooling, production tested. Good for: Analytics, Logs, Sensors, Low-Latency Stuff. 17 of 75
  18. Agenda Storm topology concepts Storm internals How does Python work

    with Storm? streamparse overview pykafka preview Slides on Twitter; follow @amontalenti. Slides: Notes: 18 of 75
  19. Storm Topology Concepts 19 of 75

  20. Storm Abstractions Storm provides abstractions for data processing: Tuple Spout

    Bolt Topology 20 of 75
  21. Wired Topology 21 of 75

  22. WARNING Using Python pseudocode and coroutines! 22 of 75

  23. Tuple A single data record that flows through your cluster.

    # tuple spec: ["word"] word = ("dog",) # tuple spec: ["word", "count"] word_count = ("dog", 4) 23 of 75
  24. Spout A component that emits raw data into cluster. class

    Spout(object): def next_tuple(): """Called repeatedly to emit tuples.""" @coroutine def spout_coroutine(spout, target): """Get tuple from spout and send it to target.""" while True: tup = spout.next_tuple() if tup is None: time.sleep(10) continue if target is not None: target.send(tup) 24 of 75
  25. Bolt A component that implements one processing stage. class Bolt(object):

    def process(tup): """Called repeatedly to process tuples.""" @coroutine def bolt_coroutine(bolt, target): """Get tuple from input, process it in Bolt. Then send it to next bolt target, if it exists.""" while True: tup = (yield) if tup is None: time.sleep(10) continue to_emit = bolt.process(tup) if target is not None: target.send(to_emit) 25 of 75
  26. Topology Directed Acyclic Graph (DAG) describing it all. # lay

    out topology spout = WordSpout bolts = [WordCountBolt, DebugPrintBolt] # wire topology topology = wire(spout=spout, bolts=bolts) # start the topology next(topology) 26 of 75
  27. Storm Internals 27 of 75

  28. Tuple Tree 28 of 75

  29. Streams, Grouping and Parallelism X word-spout word-count-bolt input None word-spout

    output word-count-bolt None tuple ("dog",) ("dog", 4") stream ["word"] ["word", "count"] grouping ["word"] ":shuffle" parallelism 2 8 29 of 75
  30. Nimbus and Storm UI 30 of 75

  31. Workers and Zookeeper 31 of 75

  32. Empty Slots 32 of 75

  33. Filled Slots and Rebalancing 33 of 75

  34. BTW, Buy This Book! Source of these diagrams. Storm Applied,

    by Manning Press. Reviewed in Storm, The Big Reference. 34 of 75
  35. Network Transfer 35 of 75

  36. So, Storm is Sorta Amazing! Storm... will guarantee processing via

    tuple trees does tuneable parallelism per component implements a high availability model allocates Python process slots on physical nodes helps us rebalance computation across cluster handles network messaging automatically And, it beats the GIL! 36 of 75
  37. Let's Do This! 37 of 75

  38. Getting Python on Storm 38 of 75

  39. Multi-Lang Protocol (1) Storm supports Python through the multi-lang protocol.

    JSON protocol Works via shell-based components Communicate over STDIN and STDOUT Clean, UNIX-y. Can use CPython, PyPy; no need for Jython or Py4J. Kinda quirky, but also relatively simple to implement. 39 of 75
  40. Multi-Lang Protocol (2) Each component of a "Python" Storm topology

    is either: ShellSpout ShellBolt Java implementations speak to Python via light JSON. There's one sub-process per Storm task. If p = 8, then 8 Python processes are spawned. 40 of 75
  41. Multi-Lang Protocol (3) INIT: JVM => Python >JSON XFER: JVM

    => JVM >Kryo DATA: JVM => Python >JSON EMIT: Python => JVM >JSON XFER: JVM => JVM >Kryo ACK: Python => JVM >JSON BEAT: JVM => Python >JSON SYNC: Python => JVM >JSON 41 of 75
  42. issues Storm bundles "" (a multi-lang implementation). But, it's

    not Pythonic. We'll fix that, we thought! 42 of 75
  43. Storm as Infrastructure Thought: Storm should be like Cassandra/Elasticsearch. "Written

    in Java, but Pythonic nonetheless." Need: Python as a first-class citizen. Must also fix "Javanonic" bits (e.g. packaging). 43 of 75
  44. streamparse overview 44 of 75

  45. Enter streamparse Initial release Apr 2014; one year of active

    development. 600+ stars on Github, was a trending repo in May 2014. 90+ mailing list members and 5 new committers. 3 engineers maintaining it. Funding from DARPA. (Yes, really!) 45 of 75
  46. streamparse CLI sparse provides a CLI front-end to streamparse, a

    framework for creating Python projects for running, debugging, and submitting Storm topologies for data processing. After installing the lein (only dependency), you can run: pip install streamparse This will offer a command-line tool, sparse. Use: sparse quickstart 46 of 75
  47. Running and debugging You can then run the local Storm

    topology using: $ sparse run Running wordcount topology... Options: {:spec "topologies/wordcount.clj", ...} #<StormTopology StormTopology(spouts:{word-spout=... storm.daemon.nimbus - Starting Nimbus with conf {... storm.daemon.supervisor - Starting supervisor with id 4960ac74... storm.daemon.nimbus - Received topology submission with conf {... ... lots of output as topology runs... See a live demo on YouTube. 47 of 75
  48. Submitting to remote cluster Single command: $ sparse submit Does

    all the following magic: Makes virtualenvs across cluster Builds a JAR out of your source code Opens reverse tunnel to Nimbus Constructs an in-memory Topology spec Uploads JAR to Nimbus 48 of 75
  49. streamparse supplants 49 of 75

  50. Let's Make a Topology! 50 of 75

  51. Word Stream Spout (Storm DSL) {"word-spout" (python-spout-spec options "spouts.words.WordSpout" ;

    class (spout) ["word"] ; stream (fields) ) } 51 of 75
  52. Word Stream Spout in Python import itertools from streamparse.spout import

    Spout class Words(Spout): def initialize(self, conf, ctx): self.words = itertools.cycle(['dog', 'cat', 'zebra', 'elephant']) def next_tuple(self): word = next(self.words) self.emit([word]) Emits one-word tuples from endless generator. 52 of 75
  53. Word Count Bolt (Storm DSL) {"word-count-bolt" (python-bolt-spec options {"word-spout" ["word"]}

    ; input (grouping) "bolts.wordcount.WordCount" ; class (bolt) ["word" "count"] ; stream (fields) :p 2 ; parallelism ) } 53 of 75
  54. Word Count Bolt in Python from collections import Counter from

    streamparse.bolt import Bolt class WordCount(Bolt): def initialize(self, conf, ctx): self.counts = Counter() def process(self, tup): word = tup.values[0] self.counts[word] += 1 self.log('%s: %d' % (word, self.counts[word])) Keeps word counts in-memory (assumes grouping). 54 of 75
  55. BatchingBolt for Performance from streamparse.bolt import BatchingBolt class WordCount(BatchingBolt): secs_between_batches

    = 5 def group_key(self, tup): # collect batches of words word = tup.values[0] return word def process_batch(self, key, tups): # emit the count of words we had per 5s batch self.emit([key, len(tups)]) Implements 5-second micro-batches. 55 of 75
  56. streamparse config.json { "envs": { "0.8": { "user": "ubuntu", "nimbus":

    "", "workers": ["", ""], "log_path": "/var/log/ubuntu/storm", "virtualenv_root": "/data/virtualenvs" }, "vagrant": { "user": "ubuntu", "nimbus": "vagrant.local", "workers": ["vagrant.local"], "log_path": "/home/ubuntu/storm/logs", "virtualenv_root": "/home/ubuntu/virtualenvs" } } } 56 of 75
  57. sparse options $ sparse help Usage: sparse quickstart <project_name> sparse

    run [-o <option>]... [-p <par>] [-t <time>] [-dv] sparse submit [-o <option>]... [-p <par>] [-e <env>] [-dvf] sparse list [-e <env>] [-v] sparse kill [-e <env>] [-v] sparse tail [-e <env>] [--pattern <regex>] sparse (-h | --help) sparse --version 57 of 75
  58. pykafka preview 58 of 75

  59. Apache Kafka "Messaging rethought as a commit log." Distributed tail

    -f. Perfect fit for Storm Spouts. Able to keep up with Storm's high-throughput processing. Great for handling backpressure during traffic spikes. 59 of 75
  60. pykafka We have released pykafka. NOT to be confused with

    kafka-python. Upgraded internal Kafka 0.7 driver to 0.8.2: SimpleConsumer and BalancedConsumer Consumer Groups with Zookeeper Pure Python protocol implementation C protocol implementation in works (via librdkafka) 60 of 75
  61. Questions? I'm sprinting on a Python Storm Topology DSL. Hacking

    on Monday and Tuesday. Join me! streamparse:'s hiring: Find me on Twitter: That's it! 61 of 75
  62. Appendix 62 of 75

  63. Storm and Spark Together 63 of 75

  64. Overall Architecture 64 of 75

  65. Multi-Lang Impl's in Python (Storm, 2010) Petrel (AirSage, Dec

    2012) streamparse (, Apr 2014) pyleus (Yelp, Oct 2014) Plans to unify IPC implementations around pystorm. 65 of 75
  66. Other Related Projects lein - Clojure dependency manager used by

    streamparse flux - YAML Topology runner Clojure DSL - Topology DSL, bundled with Storm Trident - Java "high-level" DSL, bundled with Storm streamparse uses lein and a simplified Clojure DSL. Will add a Python DSL in 2.x. 66 of 75
  67. Topology Wiring def wire(spout, bolts=[]): """Wire the components together in

    a pipeline. Return the spout coroutine that kicks it off.""" last, target = None, None for bolt in reversed(bolts): step = bolt_coroutine(bolt) if last is None: last = step continue else: step = bolt_coroutine(bolt, target=last) last = step return spout_coroutine(spout, target=last) 67 of 75
  68. Streams, Grouping, Parallelism (still pseudocode) class WordCount(Topology): spouts = [

    WordSpout( name="word-spout", out=["word"], p=2) ] bolts = [ WordCountBolt( name="word-count-bolt", from=WordSpout, group_on="word", out=["word", "count"], p=8) ] 68 of 75
  69. Storm is "Javanonic" Ironic term one of my engineers came

    up with for a project that feels very Java-like, and not very "Pythonic". 69 of 75
  70. Storm Java Quirks Topology Java builder API (eek). Projects built

    with Maven tasks (yuck). Deployment needs a JAR of your code (ugh). No simple local dev workflow built-in (boo). Storm uses Thrift interfaces (shrug). 70 of 75
  71. Multi-Lang Protocol The multi-lang protocol has the full core: ack

    fail emit anchor log heartbeat tuple tree 71 of 75
  72. Kafka and Multi-consumer 72 of 75

  73. Kafka Consumer Groups 73 of 75

  74. Bolts for Real-Time ETL 74 of 75

  75. streamparse projects 75 of 75