Modern Python Concurrency

Modern Python Concurrency

What is concurrency? Is Python good at it? How to scale up from single-node concurrency to multi-node. What does Python's new concurrent.futures stdlib library give us? What does asyncio give us? How do we evaluate concurrent performance?


Andrew Montalenti

September 03, 2014


  1. Modern Python concurrency toward futures and asyncio Andrew Montalenti Doug

    Turnbull Python Charlottesville Meetup September 2, 2014
  2. @softwaredoug @amontalenti

  3. Doug’s part

  4. What’s Concurrency? • Our system’s ability to do more than

    one thing at once I.E.: or simultaneous: web requests database transactions requests to drive requests to databases requests web services user input
  5. • run your app with less • simplify • save

    green: $ From 100 web requests per second per server... 10,000 web requests per second per server Getting it right is a Big Deal(R)
  6. Concurrency -- How? • Nature of the work? • My

    OS just does this, right? • Is Python good at this?
  7. Nature of the work? CPU Bound -- “calculating digits of

    pi” IO Bound Work -- “take request, query db, send response”
  8. What stops us from doing work? • contention -- fighting

    over a resource (like a lock) • blocking -- stopping execution to wait (stops me, lets everyone else go)
  9. My OS just does this right? • processes: sandboxed memory

    space • threads: runs within process, shares memory with other threads • These get me where I need to be?
  10. Is Python good at this? • It’s half good GIL!!!!

    Who is GIL? Global Interpreter Lock • One thread runs in Python interpreter at once • Threads tend to keep GIL until done or IO But… • This slows us down. Contention! • To teh codez!
  11. Python CPU Bound Threads from queue import Queue from threading

    import Thread inQ = Queue() outQ = Queue() def worker(): while True: l = inQ.get() sumL = sum(l) outQ.put( sumL ) numWorkers=10 ts = [Thread(target=worker) for i in xrange(numWorkers)] for t in ts: t.start() Get work to do CPU Bound Work Work output Main thread carries on Create Threads
  12. Python IO Worker Threads ... inQ = Queue() outQ =

    Queue() def worker(): while True: url = inQ.get() resp = requests.get(url) outQ.put( (url, resp.status_code, resp.text) ) numWorkers=10 ts = [Thread(target=worker) for i in xrange(numWorkers)] ... Get work to do Blocking IO Work output
  13. CPU Bound Threads… we like? from queue import Queue from

    threading import Thread inQ = Queue() outQ = Queue() def worker(): while True: l = inQ.get() sumL = sum(l) outQ.put( sumL ) numWorkers=10 ts = [Thread(target=worker) for i in xrange(numWorkers)] ... ☹ CPU Bound Work doesn’t release GIL so... ☹ … no gain from more threads/cores ☹ … only more contention ☺ Code straight-forward ☹ Contention Points?
  14. IO Worker Threads.. we like? … inQ = Queue() outQ

    = Queue() def worker(): while True: url = inQ.get() resp = requests.get(url) outQ.put( (url, resp.status_code, resp.text) ) numWorkers=10 ts = [Thread(target=worker) for i in xrange(numWorkers)] ... ☹ … how many to start? pool? ☹ Contention Points? ☺ Gives up GIL ☹ One blocking IO operation per thread so... ☺Code straight-forward
  15. Improve upon CPU bound? from multiprocessing import Process, Queue def

    worker(inQ, outQ): while True: l = inQ.get() sumL = sum(l) outQ.put( sumL ) inQ = Queue() outQ = Queue() p = Process(target=worker, args=(inQ, outQ)) p.start() Using Processes (and an interprocess Queue) Same code
  16. CPU Bound Processes… we like? def worker(inQ, outQ): while True:

    l = inQ.get() sumL = sum(l) outQ.put( sumL ) numWorkers=10 ts = [Process(target=worker, args=(inQ, outQ)) for i in xrange(numWorkers)] ... ☺Not sharing GIL ☺ No GIL: More Processes means concurrent work ☺ Max out all your cores! ☺ Code straight-forward ☹ Contention? ☺(less scary -- process abstractions) We LIKE!
  17. Processes Rule, Threads Drool Threads Processes Light? Yes ☺ Almost

    As Light ☺ Danger? High -- mutable, shared state. deadlocks. ☹ Lower, stricter communication ☺ Communication Primitives? Mutexes/locks, atomic CPU instructions, thread-safe data structures ☺ OS abstractions, pipes, sockets, shared memory, etc ☺ If they crash... whole program crashes ☹ only process crashes ☺ Control? Extremely High ☺ Moderate, through abstractions GIL? Yes ☹ No ☺
  18. Processes Rule, Threads Drool • Processes: safer choice, sandboxed, coarse-grain

    control • Threads: dangerous choice, very fine-grain control/interaction
  19. Improve on IO bound work? def worker(): while True: handle_ui_input()

    handle_io() def worker1(): while True: resp = handle_io1() update_shared_state(resp) def worker2(): while True: resp = handle_io2() update_shared_state(resp) Blocking Contention -- need to lock and update IO Bound work often looks like:
  20. Other OS primitives help? def event_loop(): while True: whichIsReady =

    select(ui, io1, io2) if whichIsReady == io1: resp = handle_io1(req) if whichIsReady == ui ... Simultaneously block on multiple IO operations No longer need to lock shared state (in one thread)
  21. Non-blocking IO… we like? def event_loop(): schedIo = [...] while

    True: whichIsReady = select(schedIo) whichIsReady.callback() httpReq = HttpReq(‘’) def callWhenDone(): print httpReq.fetch(callback=callWhenDone) ... ☹ harder to read (Javascript anyone?) ☹ extra “IO” scheduler/event loop ☺concurrent IO not bound to number of threads
  22. Non-blocking IO improvements httpReq = HttpReq(‘’) def callWhenDone(): print “Fetched

    and stored imgs!” def storeInDb(httpPromise) dbPromise = return dbPromise promise = httpReq.fetch() imgPromise = parseImgTags(promise) dbPromise = storedInDb(imgPromise) dbPromise.onDone(callback=callWhenDone) Promise/Deferred/Futures Handle to future work Handle to pending IO Chainable with other operations Can still use callbacks
  23. Non-blocking IO improvements httpReq = HttpReq(‘’) def storeInDb(httpPromise) dbPromise =

    storeInDb(httpPromise) return dbPromise promise = httpReq.fetch() promise.whenComplete(callWhenDone) imgPromise = parseImgTags(promise) dbPromise = storedInDb(imgPromise) myCoroutine.yieldUntil(dbPromise) print “Fetched and stored IMG tags!” Coroutines/cooperative multitasking. I own this thread until I say I’ m done Looks like a blocking call, but in reality yields back to event loop ☺ Readability of blocking IO ☺ Performance of non-blocking async IO
  24. Example: Greenlet/Gevent Greenlet: • coroutine library, greenlet decides when to

    give up control Gevent: • monkey-patch a big event loop into Python, replacing core blocking IO bits
  25. Gevent is magic class Fetcher(object): def __init__(self, fetchUrl): self.fetchUrl =

    fetchUrl self.resp = None def fetch(self): self.resp = requests.get(self.fetchUrl) def fetchMultiple(urls): fetchers = [Fetcher(url) for url in urls] handles = [] for fetcher in fetchers: handles.append(gevent.spawn(fetcher.fetch)) gevent.joinall(handles) Blocking? Depends on your definition of “blocking” Spawn a Gevent worker that calls “fetch” Wait till all done
  26. Other Solutions (aka future lightning talk fodder) Twisted CPython C

    Modules (scipy, your module) Cython (compile to Python -> C) Jython/IronPython (JIT to JVM or .NET CLI) GPUs (CUDA, etc) cluster frameworks (discussed later)
  27. Andrew’s part

  28. Python GIL Why it’s there What it messes up, concurrency-wise

    Failed efforts to remove GIL pypy and pypy-stm
  29. Python concurrency menu CPU-bound work IO-bound work single-node multiprocessing ProcessPoolExecutor

    Twisted (Network) Tornado (HTTP) ThreadPoolExecutor (Generic) gevent (Sockets) asyncore (Sockets) asyncio (Generic) multi-node rq celery IPython.parallel scrapyd? (HTTP)
  30. I love processes. (you should, too)

  31. Choose your own message broker Pure Python tasks Pure Python

    worker infrastructure Advanced message patterns Choose your own message broker Mix Python + Java or other languages Java worker infrastructure Advanced message patterns More complex operationally High availability & linear scalability “Lambda Architecture” Options for multi-node concurrency start here upgrade here Redis as message broker Pure Python tasks Pure Python worker infrastructure Simple message patterns
  32. Python concurrency in clusters • mrjob: Hadoop Streaming (batch) •

    streamparse: Apache Storm (real-time) • parallelize through Python process model • mixed workloads ◦ CPU- and IO-bound • mixed concurrency models are possible ◦ threads within Storm Bolts ◦ process pools within Hadoop Tasks
  33. My instinct: threads = bugs

  34. None
  35. But, sometimes necessary - BatchingBolt - IO-bound Drivers

  36. What is asyncore? • stdlib-included async sockets (like libev) •

    in stdlib since 2000! Comment from the source code in 2000: There are only two ways to have a program on a single processor do "more than one thing at a time". Multi-threaded programming is the simplest and most popular way to do it, but there is another very different technique, that lets you have nearly all the advantages of multi-threading, without actually using multiple threads. it's really only practical if your program is largely I/O bound. If your program is CPU bound, then pre- emptive scheduled threads are probably what you really need. Network servers are rarely CPU-bound, however.
  37. Python async networking comparison to nginx / Node.JS Twisted Tornado

    gevent / gthreads all using their own “reactor” / “event loop”
  38. What is concurrent.futures? • PEP-3148 • new unified API for

    concurrency in Python • in stdlib in Python 3.2+ • backport in 2.7 (pip install futures) • API design like Java Executor Framework • a Future abstraction as a return value • an Executor abstraction for running things
  39. asyncio history • PEP-3153: Async IO Support • PEP-3156: Async

    IO Support “Rebooted” • GvR’s pet project from 2012-2014 • Original implementation called tulip • Released in Python 3.4 as asyncio • PyCon 2013 keynote by GvR focused on it • PEP-380 (yield from) utilized by it
  40. asyncio primitives • a loop that starts and stops •

    callback scheduling ◦ now ◦ time in in the future ◦ repeated / periodic • associate callbacks with file I/O states • offer pluggable I/O multiplexing mechanism ◦ select() ◦ poll(), epoll(), others
  41. asyncio “ooooh ahhhh” moments • introduces @coroutine decorator • uses

    yield from to simplify callback hell • one event loop to rule them all ◦ Twisted and Tornado and gevent in same app! • offers an asyncio.Future ◦ asyncio.Future quacks-like futures.Future ◦ asyncio.wrap_future is an adapter ◦ asyncio.Task is subclass of Future
  42. @dabeaz on generators

  43. asyncio coroutines code explain result = yield from future suspend

    until future done, then return result result = yield from coroutine suspend until coroutine returns a result return expression return a result to another coroutine raise exception raise an exception to another coroutine
  44. @dabeaz on Future and Task unity asyncio and an event

    loop with scheduler threading with a thread pool and executor
  45. What is the gain of “yield from”? (1) So, why

    don’t I like green threads? In a simple program using stackless or gevent, it’s easy enough to say, ‘This is a call that goes to the scheduler -- it uses read() or send() or something. I know that’s a blocking call, I’ll be careful…. I don’t need explicit locking because between points A or B, I just need to make sure I don’t make any other calls to the scheduler.’ However, as code gets longer, it becomes hard to keep track. Sooner or later… - Guido van Rossum
  46. What is the gain of “yield from”? (2) Just trust

    me; this problem happened to me at a young and impressionable age, and I’ve never forgotten it. - Guido van Rossum
  47. The Future() is now() - Tulip project update (Jan 2014)

    by @guidovanrossum - Unyielding (Feb 2014) by @glyph - Generators, The Final Frontier (Apr 2014) by @dabeaz
  48. Concurrency in the real world • Cassandra driver shows different

    models • pip install cassandra-driver • asyncore and libev event loops preferred • twisted and gevent also provided • performance benchmarking is dramatic (ranges from 2k to 18k write ops per sec)
  49. yield from CassandraDemo() • spin up a Cassandra node •

    execute / benchmark naive sync writes • switch to batched futures • switch to callback chaining • try different event loops • switch to pypy • discuss how asyncio could clean this up
  50. bonus round: asyncio crawler • if there’s time, show GvR’s

    example crawler • asyncio, @coroutine, and yield from in a readable Python 3.4 program • crawls 1,300 pages in 7 seconds