What is concurrency? Is Python good at it? How to scale up from single-node concurrency to multi-node. What does Python's new concurrent.futures stdlib library give us? What does asyncio give us? How do we evaluate concurrent performance?
What’s Concurrency? ● Our system’s ability to do more than one thing at once I.E.: or simultaneous: web requests database transactions requests to drive requests to databases requests web services user input
● run your app with less ● simplify ● save green: $ From 100 web requests per second per server... ...to 10,000 web requests per second per server Getting it right is a Big Deal(R)
What stops us from doing work? ● contention -- fighting over a resource (like a lock) ● blocking -- stopping execution to wait (stops me, lets everyone else go)
My OS just does this right? ● processes: sandboxed memory space ● threads: runs within process, shares memory with other threads ● These get me where I need to be?
Is Python good at this? ● It’s half good GIL!!!! Who is GIL? Global Interpreter Lock ● One thread runs in Python interpreter at once ● Threads tend to keep GIL until done or IO But… ● This slows us down. Contention! ● To teh codez!
Python CPU Bound Threads from queue import Queue from threading import Thread inQ = Queue() outQ = Queue() def worker(): while True: l = inQ.get() sumL = sum(l) outQ.put( sumL ) numWorkers=10 ts = [Thread(target=worker) for i in xrange(numWorkers)] for t in ts: t.start() Get work to do CPU Bound Work Work output Main thread carries on Create Threads
Python IO Worker Threads ... inQ = Queue() outQ = Queue() def worker(): while True: url = inQ.get() resp = requests.get(url) outQ.put( (url, resp.status_code, resp.text) ) numWorkers=10 ts = [Thread(target=worker) for i in xrange(numWorkers)] ... Get work to do Blocking IO Work output
CPU Bound Threads… we like? from queue import Queue from threading import Thread inQ = Queue() outQ = Queue() def worker(): while True: l = inQ.get() sumL = sum(l) outQ.put( sumL ) numWorkers=10 ts = [Thread(target=worker) for i in xrange(numWorkers)] ... ☹ CPU Bound Work doesn’t release GIL so... ☹ … no gain from more threads/cores ☹ … only more contention ☺ Code straight-forward ☹ Contention Points?
IO Worker Threads.. we like? … inQ = Queue() outQ = Queue() def worker(): while True: url = inQ.get() resp = requests.get(url) outQ.put( (url, resp.status_code, resp.text) ) numWorkers=10 ts = [Thread(target=worker) for i in xrange(numWorkers)] ... ☹ … how many to start? pool? ☹ Contention Points? ☺ Gives up GIL ☹ One blocking IO operation per thread so... ☺Code straight-forward
Improve upon CPU bound? from multiprocessing import Process, Queue def worker(inQ, outQ): while True: l = inQ.get() sumL = sum(l) outQ.put( sumL ) inQ = Queue() outQ = Queue() p = Process(target=worker, args=(inQ, outQ)) p.start() Using Processes (and an interprocess Queue) Same code
CPU Bound Processes… we like? def worker(inQ, outQ): while True: l = inQ.get() sumL = sum(l) outQ.put( sumL ) numWorkers=10 ts = [Process(target=worker, args=(inQ, outQ)) for i in xrange(numWorkers)] ... ☺Not sharing GIL ☺ No GIL: More Processes means concurrent work ☺ Max out all your cores! ☺ Code straight-forward ☹ Contention? ☺(less scary -- process abstractions) We LIKE!
Processes Rule, Threads Drool Threads Processes Light? Yes ☺ Almost As Light ☺ Danger? High -- mutable, shared state. deadlocks. ☹ Lower, stricter communication ☺ Communication Primitives? Mutexes/locks, atomic CPU instructions, thread-safe data structures ☺ OS abstractions, pipes, sockets, shared memory, etc ☺ If they crash... whole program crashes ☹ only process crashes ☺ Control? Extremely High ☺ Moderate, through abstractions GIL? Yes ☹ No ☺
Improve on IO bound work? def worker(): while True: handle_ui_input() handle_io() def worker1(): while True: resp = handle_io1() update_shared_state(resp) def worker2(): while True: resp = handle_io2() update_shared_state(resp) Blocking Contention -- need to lock and update IO Bound work often looks like:
Other OS primitives help? def event_loop(): while True: whichIsReady = select(ui, io1, io2) if whichIsReady == io1: resp = handle_io1(req) if whichIsReady == ui ... Simultaneously block on multiple IO operations No longer need to lock shared state (in one thread)
Non-blocking IO improvements httpReq = HttpReq(‘http://odu.edu’) def storeInDb(httpPromise) dbPromise = storeInDb(httpPromise) return dbPromise promise = httpReq.fetch() promise.whenComplete(callWhenDone) imgPromise = parseImgTags(promise) dbPromise = storedInDb(imgPromise) myCoroutine.yieldUntil(dbPromise) print “Fetched and stored IMG tags!” Coroutines/cooperative multitasking. I own this thread until I say I’ m done Looks like a blocking call, but in reality yields back to event loop ☺ Readability of blocking IO ☺ Performance of non-blocking async IO
Example: Greenlet/Gevent Greenlet: ● coroutine library, greenlet decides when to give up control Gevent: ● monkey-patch a big event loop into Python, replacing core blocking IO bits
Gevent is magic class Fetcher(object): def __init__(self, fetchUrl): self.fetchUrl = fetchUrl self.resp = None def fetch(self): self.resp = requests.get(self.fetchUrl) def fetchMultiple(urls): fetchers = [Fetcher(url) for url in urls] handles = [] for fetcher in fetchers: handles.append(gevent.spawn(fetcher.fetch)) gevent.joinall(handles) Blocking? Depends on your definition of “blocking” Spawn a Gevent worker that calls “fetch” Wait till all done
Other Solutions (aka future lightning talk fodder) Twisted CPython C Modules (scipy, your module) Cython (compile to Python -> C) Jython/IronPython (JIT to JVM or .NET CLI) GPUs (CUDA, etc) cluster frameworks (discussed later)
Choose your own message broker Pure Python tasks Pure Python worker infrastructure Advanced message patterns Choose your own message broker Mix Python + Java or other languages Java worker infrastructure Advanced message patterns More complex operationally High availability & linear scalability “Lambda Architecture” Options for multi-node concurrency start here upgrade here Redis as message broker Pure Python tasks Pure Python worker infrastructure Simple message patterns
Python concurrency in clusters ● mrjob: Hadoop Streaming (batch) ● streamparse: Apache Storm (real-time) ● parallelize through Python process model ● mixed workloads ○ CPU- and IO-bound ● mixed concurrency models are possible ○ threads within Storm Bolts ○ process pools within Hadoop Tasks
What is asyncore? ● stdlib-included async sockets (like libev) ● in stdlib since 2000! Comment from the source code in 2000: There are only two ways to have a program on a single processor do "more than one thing at a time". Multi-threaded programming is the simplest and most popular way to do it, but there is another very different technique, that lets you have nearly all the advantages of multi-threading, without actually using multiple threads. it's really only practical if your program is largely I/O bound. If your program is CPU bound, then pre- emptive scheduled threads are probably what you really need. Network servers are rarely CPU-bound, however.
What is concurrent.futures? ● PEP-3148 ● new unified API for concurrency in Python ● in stdlib in Python 3.2+ ● backport in 2.7 (pip install futures) ● API design like Java Executor Framework ● a Future abstraction as a return value ● an Executor abstraction for running things
asyncio history ● PEP-3153: Async IO Support ● PEP-3156: Async IO Support “Rebooted” ● GvR’s pet project from 2012-2014 ● Original implementation called tulip ● Released in Python 3.4 as asyncio ● PyCon 2013 keynote by GvR focused on it ● PEP-380 (yield from) utilized by it
asyncio primitives ● a loop that starts and stops ● callback scheduling ○ now ○ time in in the future ○ repeated / periodic ● associate callbacks with file I/O states ● offer pluggable I/O multiplexing mechanism ○ select() ○ poll(), epoll(), others
asyncio “ooooh ahhhh” moments ● introduces @coroutine decorator ● uses yield from to simplify callback hell ● one event loop to rule them all ○ Twisted and Tornado and gevent in same app! ● offers an asyncio.Future ○ asyncio.Future quacks-like futures.Future ○ asyncio.wrap_future is an adapter ○ asyncio.Task is subclass of Future
asyncio coroutines code explain result = yield from future suspend until future done, then return result result = yield from coroutine suspend until coroutine returns a result return expression return a result to another coroutine raise exception raise an exception to another coroutine
What is the gain of “yield from”? (1) So, why don’t I like green threads? In a simple program using stackless or gevent, it’s easy enough to say, ‘This is a call that goes to the scheduler -- it uses read() or send() or something. I know that’s a blocking call, I’ll be careful…. I don’t need explicit locking because between points A or B, I just need to make sure I don’t make any other calls to the scheduler.’ However, as code gets longer, it becomes hard to keep track. Sooner or later… - Guido van Rossum
What is the gain of “yield from”? (2) Just trust me; this problem happened to me at a young and impressionable age, and I’ve never forgotten it. - Guido van Rossum
The Future() is now() - Tulip project update (Jan 2014) by @guidovanrossum - Unyielding (Feb 2014) by @glyph - Generators, The Final Frontier (Apr 2014) by @dabeaz
Concurrency in the real world ● Cassandra driver shows different models ● pip install cassandra-driver ● asyncore and libev event loops preferred ● twisted and gevent also provided ● performance benchmarking is dramatic (ranges from 2k to 18k write ops per sec)
yield from CassandraDemo() ● spin up a Cassandra node ● execute / benchmark naive sync writes ● switch to batched futures ● switch to callback chaining ● try different event loops ● switch to pypy ● discuss how asyncio could clean this up
bonus round: asyncio crawler ● if there’s time, show GvR’s example crawler ● asyncio, @coroutine, and yield from in a readable Python 3.4 program ● crawls 1,300 xkcd.com pages in 7 seconds