• Apache/Subversion Committer • Founded Snakebite @ Michigan State University o AIX RS/6000 o SGI IRIX/MIPS o Alpha/Tru64 o Solaris/SPARC o HP-UX/IA64 o FreeBSD, NetBSD, OpenBSD, DragonFlyBSD • Background is UNIX • Made peace with Windows when XP came out • Author of PyParallel
multiple interpreter threads to run in parallel without incurring any additional performance penalties • Intrinsically paired with Windows asynchronous I/O primitives (IOCP), Vista+ thread pools and kernel synchronization primitives • Working prototype/proof-of-concept after 3 months: o Performant HTTP server, written in Python, automatically exploits all cores: pyparallel.exe -m async.http.server
solve? • Wasn’t happy with the status quo o Parallel options (for compute-bound, data parallelism problems): • GIL prevents simultaneous multithreading • ….so you have to rely on separate Python processes if you want to exploit more than one core o Concurrency options (for I/O-bound or I/O-driven, task parallelism problems): • One thread per client, blocking I/O • Single-thread, event loop, multiplexing system call (select/poll/epoll/kqueue)
o Computationally-intensive (compute-bound) work against TBs/PBs of data (I/O-bound) o Serving tens of thousands of network clients (I/O-driven) with non-trivial computation required per request (compute-bound) o Serving fewer clients, but providing ultra-low latency or maximum throughput to those you do serve (HFT, remote array servers, etc) • Contemporary data center hardware : o 128 cores, 512GB RAM o Quad 10Gb Ethernet NICs o SSDs & Fusion-IO style storage -> 500k-800k+ IOPS from a single device o 2016: 128Gb Fibre Channel (4x32Gb) -> 25.6GB/s throughput
problems as optimally as my hardware will allow • Optimal hardware use necessitates things like: o One active thread per core • Any more results in unnecessary context switches o No unnecessary duplication of shared/common data in memory o Ability to saturate the bandwidth of my I/O devices • And I want to do it all in Python • ....yet still be competitive against C/C++ where it matters • So for this talk: o What are my options today? o What might my options look like tomorrow?
things at the same time • Task A doesn’t need to complete before you can start work on task B o Typically used to describe I/O-bound or I/O-driven systems, especially network- oriented socket servers • Parallelism: o Making progress on one thing in multiple places at the same time • Task A is split into 8 parts, each part runs on a separate core o Typically used in compute-bound contexts • Map/reduce, aggregation, “embarrassingly parallelizable” data etc
etc)… • Concurrency: how many things did I do? o Things = units of work (e.g. servicing network clients) o Performance benchmark: • How fast was everyone served? (i.e. request latency) • And were they served fairly? • Parallelism: how many things did I do them on? o Things = hardware units (e.g. CPU cores, GPU cores) o Performance benchmark : • How much did I get done? • How long did it take?
some pretty decent Python libraries out there geared toward concurrency o Twisted, Tornado, Tulip/asyncio (3.x), etc • Common themes: o Set all your sockets and file descriptors to non-blocking o Write your Python in an event-oriented fashion • def data_received(self, data): … • Hollywood Principle: don’t call us, we’ll call you o Appearance of asynchronous I/O achieved via single-threaded event loop with multiplexing system call • Biggest drawback: o Inherently limited to a single-core o Thus, inadequate for problems that are both concurrent and computationally bound
processing daily files o Data mining distinct segments/chunks/partitions o Process A runs on X data set, independent to process B running on Y data set • Fine-grained (data parallelism) o Map/reduce, divide & conquer, aggregation etc o Common theme: sequential execution, fan out to parallel against shared data set, collapse back down to sequential
adequate: using multiple processes that don’t need to talk to each other (or if they do, don’t need to talk often) o Depending on shared state, could still benefit if implemented via threads instead of processes • Better cache usage, less duplication of identical memory structures, less overhead overall • Fine-grained (multiple threads): o Typically optimal: using multiple threads within the same address space o IPC overhead can severely impact net performance when having to use processes instead of threads
lock) prevents more than one Python interpreter thread running at a given time • If you want to use multiple threads within the same Python process, you have to come up with a way to avoid the GIL o (Fine-grained parallelism =~ multithreading) • Today, this relies on: o Extension modules or libraries o Bypassing CPython interpreter entirely and compiling to machine code
modules or libraries: • Accelerate/NumbaPro (GPU, Multicore) • OpenCV • Intel MKL Libraries o Bypassing the CPython interpreter entirely by compiling Python to machine code: • Numba with threading • Cython with OpenMP • Options tomorrow (Python 4.x): o PyParallel? • Demonstrates it is possible to have multiple threads running CPython interpreter threads in parallel without incurring a performance overhead o PyPy-STM?
your problem: o https://wiki.python.org/moin/ParallelProcessing o batchlib, Celery, Deap, disco, dispy, DistributedPython, exec_proxy, execnet, IPython Parallel, jug, mpi4py, PaPy, pyMPI, pypar, pypvm, Pyro, rthread, SCOOP, seppo, superspy • Python stdlib options: o multiprocessing (since 2.6) o concurrent.futures (introduced in 3.2, backported to 2.7) • Common throughout: o Separate Python processes to achieve parallel execution
concurrency in Python would be complete without mentioning the GIL (global interpreter lock) • What is it? o A lock that ensures only one thread can execute CPython innards at any given time o Create 100 threading.Thread() instances… o ....and only one will run at any given time • So why even support threads if they can’t run in parallel? • Because they can be useful for blocking, I/O-bound problems o Ironically, they facilitate concurrency in Python, not parallelism • But they won’t solve your compute-bound problem any faster • Nor will you ever exploit more than one core
problem • Inadequate for fine-grained parallelism • Inadequate for I/O-driven problems (specifically socket servers) • Overhead of extra processes • No shared memory out of the box (I’d have to set it up myself) • Kinda’ quirky on Windows • The examples in the docs are trivialized and don’t really map to real world problems o https://docs.python.org/2/library/multiprocessing.html o i.e. x*x for x in [1, 2, 3, 4]
128 cores, 512GB RAM • I want to use multiprocessing to solve my compute-bound problem • And I want to optimally use my hardware; idle cores are useless • So how big should my multiprocessor pool be? How many processes? • 128, right?
need to do I/O • And you’re probably going to be doing blocking I/O o i.e. synchronous read/write calls o Non-blocking I/O is poorly suited to multiprocessing as you’d need to have per-process event-loops doing the syscall multiplexing dance • The problem is, as soon as you block, that’s one less process able to do useful work • Can quickly become pathological: o Start a pool of 64 processors (for 64 cores) o Few minutes later: only 20-25 active • Is the solution to create a bigger pool?
the number of processes isn’t the solution • Results in pathological behavior on the opposite end of the spectrum o Instead of idle cores, you have over-scheduled cores o Significant overhead incurred by context switching • Cache pollution, TLB contention o You can visibly see this with basic tools like top: 20% user, 80% sys • Neither approach is optimal today: o processes <= ncpu: idle cores o processes > ncpu: over-scheduled cores
our problems optimally on our powerful hardware • Avoid the sub-optimal: o Idleness o Blocking I/O o Context switching o Wasteful memory use • Encourage the optimal: o One active thread per core o Efficient memory use
complex problem • Intrinsically dependent upon I/O facilities provided by the OS: o Readiness-oriented or completion-oriented? o Thread-agnostic I/O or thread-specific I/O? • Plus one critical element: o Disassociating the work (computation) from the worker (thread) o Associating a desired concurrency level (i.e. use all my cores) with the work • This allows the kernel to make intelligent thread dispatching decisions o Ensures only one active thread per core o No over-scheduling or unnecessary context switches • Turns out Windows is really good at all this
I/O manager pushes completion packets asynchronously • Threads pop completions off and process results: I/O Completion Ports do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP
number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue • ....and schedules another thread to run do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
and approach to asynchronous I/O is very well suited to what I wanted to do with PyParallel • Vista introduced new thread pool APIs • Tightly integrated into IOCP/overlapped ecosystem • Greatly reduces the amount of scaffolding code I needed to write to prototype the concept void PxSocketClient_Callback(); CreateThreadpoolIo(.., &PxSocketClient_Callback) .. StartThreadpoolIo(..) AcceptEx(..)/WSASend(..)/WSARecv(..) • That’s it. When the async I/O op completes, your callback gets invoked • Windows manages everything: optimal thread pool size, NUMA-cognizant dispatching • Didn’t need to create a single thread, no mutexes, none of the normal headaches that come with multithreading
exploit my hardware • But it’s still Python… • ….and Python can be kinda’ slow • Especially when doing computationally-intensive work o Especially especially when doing numerically-oriented computation • Enter… Numba!
PyParallel support • @jit hooks introduced in the stdlib • ….and an API for multiple downstream jitters to hook into o CPython broadcasts the AST/bytecode being executed in ceval to jitters o Multiple jitters running in separate threads o CPython: “Hey, can you optimize this chunk of Python? Let me know.” o Next time it encounters that chunk, it can check for optimal versions • Could provide a viable way of hooking in Numba, PyPy, Pythran, ShedSkin, etc, whilst still staying within the confines of CPython
asynchronous I/O • Threads have been first class citizens since day 1 (not bolted on as an afterthought) • Designed to be programmed in a completion-oriented, multi-threaded fashion • Overlapped I/O + IOCP + threads + kernel synchronization primitives = excellent combo for achieving high performance
is to understand the problem it was designed to solve: o Facilitate writing high-performance network/file servers (http, database, file server) o Extract maximum performance from multi-processor/multi-core hardware o (Which necessitates optimal resource usage)
running on every core servicing a client request o Upon finishing a client request, immediately processes the next request if one is waiting o Never block o (And if you do block, handle it as optimally as possible) • Optimal resource usage o One active thread per core
to non-blocking o Try read or write data o Get EAGAIN instead of blocking o Try again later • Windows approach o Create an overlapped I/O structure o Issue a read or write, passing the overlapped structure and completion port info o Call returns immediately o Read/write done asynchronously by I/O manager o Optional completion packet queued to the completion port a) on error, b) on completion. o Thread waiting on completion port de-queues completion packet and processes request
to write yet yet? o No? How about now? o Still no? o Now? o Yes!? Really? Ok, write it! o Hi! Me again. Anything to read? o No? o How about now? • Windows approach: o Here, do this. Let me know when it’s done. Readiness-oriented Completion-oriented (reactor pattern) (proactor pattern)
do just about everything • Basically, if it could block, there’s a way to do it asynchronously in Windows • WSASend and WSARecv • AcceptEx() vs accept() • ConnectEx() vs connect() • DisconnectEx() vs close() • GetAddrinfoEx() vs getaddrinfo() (Windows 8+) • (And that’s just for sockets; all device I/O can be done asynchronously)
on Windows • IOCPs allow IRP completion (copying data from nonpaged kernel memory back to user’s buffer) to be deferred to a thread-agnostic queue • Any thread can wait on this queue (completion port) via GetQueuedCompletionStatus() • IRP completion done just before that call returns • Allows I/O manager to rapidly queue IRP completions • ....and waiting threads to instantly dequeue and process
I/O manager pushes completion packets asynchronously • Threads pop completions off and process results: IOCP and Concurrency do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP
performance o Optimize resource usage • Optimal number of active threads running per core: 1 • Optimal number of total threads running: 1 * ncpu • Windows can’t control how many threads you create and then have waiting against the completion port • But it can control when and how many threads get awoken • ….via the IOCP’s maximum concurrency value • (Specified when you create the IOCP)
number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue • ....and schedules another thread to run do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
work: o No GIL removal • This was previously tried and rejected • Required fine-grained locking throughout the interpreter • Mutexes are expensive • Single-threaded execution significantly slower o Not using PyPy’s approach via Software Transactional Memory (STM) • Huge overhead • 64 threads trying to write to something, 1 wins, continues • 63 keep trying • 63 bottles of beer on the wall… • Doesn’t support “free threading” o Existing code using threading.Thread won’t magically run on all cores o You need to use the new async APIs
serves a very useful purpose • Instead, intercept all thread-sensitive calls: o Reference counting (Py_(INCREF|DECREF|CLEAR)) o Memory management (PyMem_(Malloc|Free), PyObject_(INIT|NEW)) o Free lists o Static C globals o Interned strings • If we’re the main thread, do what we normally do • However, if we’re a parallel thread, do a thread-safe alternative
thread, do X, if not, do Y” o X = thread-safe alternative o Y = what we normally do • “If we’re a parallel thread” o Thread-sensitive calls are ubiquitous o But we want to have a negligible performance impact o So the challenge is how quickly can we detect if we’re a parallel thread o The quicker we can detect it, the less overhead incurred
#define Py_PXCTX (Py_MainThreadId != _Py_get_current_thread_id()) • What’s so special about _Py_get_current_thread_id()? o On Windows, you could use GetCurrentThreadId() o On POSIX, pthread_self() • Unnecessary overhead (this macro will be everywhere) • Is there a quicker way? • Can we determine if we’re running in a parallel context without needing a function call?
{ \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + } while (0) + +#endif /* WITH_PARALLEL */ • Py_PXCTX == (Py_MainThreadId == __readfsdword(0x48)) • Overhead reduced to a couple more instructions and an extra branch (cost of which can be eliminated by branch prediction) • That’s basically free compared to STM or fine-grained locking
Py_PXCTX for normal single-threaded code o GIL removal: 40% overhead o PyPy’s STM: “200-500% slower” • Only touches a relatively small amount of code o No need for intrusive surgery like re-writing a thread-safe bucket memory allocator or garbage collector • Keeps GIL semantics o Important for legacy code o 3rd party libraries, C extension code • Code executing in parallel context has full visibility to “main thread objects” (in a read-only capacity, thus no need for locks)
demo coming up: o One python_d.exe process o Constant memory use o CPU use proportional to concurrent client count (1 client = 25% CPU use) o Every 10,000 sends, a status message is printed • Depicts dynamically switching from synchronous sends to async sends • Illustrates awareness of active I/O hogs • Environment: o Macbook Pro, 8 core i7 2.2GHz, 8GB RAM o 1-5 netcat instances on OS X o Windows 7 instance running in Parallels, 4 cores, 3GB