Parallelism and Concurrency with Python

Parallelism and Concurrency with Python Trent Nelson Managing Director, New
York Continuum Analytics @ContinuumIO, @trentnelson [email protected] http://speakerdeck.com/trent/

About Me • Systems Software Engineer • Core Python Committer
• Apache/Subversion Committer • Founded Snakebite @ Michigan State University o AIX RS/6000 o SGI IRIX/MIPS o Alpha/Tru64 o Solaris/SPARC o HP-UX/IA64 o FreeBSD, NetBSD, OpenBSD, DragonFlyBSD • Background is UNIX • Made peace with Windows when XP came out • Author of PyParallel

PyParallel • Set of modifications to CPython interpreter • Allows
multiple interpreter threads to run in parallel without incurring any additional performance penalties • Intrinsically paired with Windows asynchronous I/O primitives (IOCP), Vista+ thread pools and kernel synchronization primitives • Working prototype/proof-of-concept after 3 months: o Performant HTTP server, written in Python, automatically exploits all cores: pyparallel.exe -m async.http.server

Motivation behind PyParallel • What problem was I trying to
solve? • Wasn’t happy with the status quo o Parallel options (for compute-bound, data parallelism problems): • GIL prevents simultaneous multithreading • ….so you have to rely on separate Python processes if you want to exploit more than one core o Concurrency options (for I/O-bound or I/O-driven, task parallelism problems): • One thread per client, blocking I/O • Single-thread, event loop, multiplexing system call (select/poll/epoll/kqueue)

What if I’m I/O-bound and compute-bound? • Contemporary enterprise problems:
o Computationally-intensive (compute-bound) work against TBs/PBs of data (I/O-bound) o Serving tens of thousands of network clients (I/O-driven) with non-trivial computation required per request (compute-bound) o Serving fewer clients, but providing ultra-low latency or maximum throughput to those you do serve (HFT, remote array servers, etc) • Contemporary data center hardware : o 128 cores, 512GB RAM o Quad 10Gb Ethernet NICs o SSDs & Fusion-IO style storage -> 500k-800k+ IOPS from a single device o 2016: 128Gb Fibre Channel (4x32Gb) -> 25.6GB/s throughput

Real Problems, Powerful Hardware • I want to solve my
problems as optimally as my hardware will allow • Optimal hardware use necessitates things like: o One active thread per core • Any more results in unnecessary context switches o No unnecessary duplication of shared/common data in memory o Ability to saturate the bandwidth of my I/O devices • And I want to do it all in Python • ....yet still be competitive against C/C++ where it matters • So for this talk: o What are my options today? o What might my options look like tomorrow?

First, some definitions…

Concurrency versus Parallelism • Concurrency: o Making progress on multiple
things at the same time • Task A doesn’t need to complete before you can start work on task B o Typically used to describe I/O-bound or I/O-driven systems, especially network- oriented socket servers • Parallelism: o Making progress on one thing in multiple places at the same time • Task A is split into 8 parts, each part runs on a separate core o Typically used in compute-bound contexts • Map/reduce, aggregation, “embarrassingly parallelizable” data etc

So for a given time frame T (1us, 1ms, 1s
etc)… • Concurrency: how many things did I do? o Things = units of work (e.g. servicing network clients) o Performance benchmark: • How fast was everyone served? (i.e. request latency) • And were they served fairly? • Parallelism: how many things did I do them on? o Things = hardware units (e.g. CPU cores, GPU cores) o Performance benchmark : • How much did I get done? • How long did it take?

Concurrent Python

Concurrent Python • I/O-driven client/server systems (socket-oriented) • There are
some pretty decent Python libraries out there geared toward concurrency o Twisted, Tornado, Tulip/asyncio (3.x), etc • Common themes: o Set all your sockets and file descriptors to non-blocking o Write your Python in an event-oriented fashion • def data_received(self, data): … • Hollywood Principle: don’t call us, we’ll call you o Appearance of asynchronous I/O achieved via single-threaded event loop with multiplexing system call • Biggest drawback: o Inherently limited to a single-core o Thus, inadequate for problems that are both concurrent and computationally bound

Parallel Python

Coarse-grained versus fine-grained parallelism • Coarse-grained (task parallelism) o Batch
processing daily files o Data mining distinct segments/chunks/partitions o Process A runs on X data set, independent to process B running on Y data set • Fine-grained (data parallelism) o Map/reduce, divide & conquer, aggregation etc o Common theme: sequential execution, fan out to parallel against shared data set, collapse back down to sequential

Coarse-grained versus fine-grained parallelism • Coarse-grained (multiple processes): o Typically
adequate: using multiple processes that don’t need to talk to each other (or if they do, don’t need to talk often) o Depending on shared state, could still benefit if implemented via threads instead of processes • Better cache usage, less duplication of identical memory structures, less overhead overall • Fine-grained (multiple threads): o Typically optimal: using multiple threads within the same address space o IPC overhead can severely impact net performance when having to use processes instead of threads

Python landscape for fine-grained parallelism • Python’s GIL (global interpreter
lock) prevents more than one Python interpreter thread running at a given time • If you want to use multiple threads within the same Python process, you have to come up with a way to avoid the GIL o (Fine-grained parallelism =~ multithreading) • Today, this relies on: o Extension modules or libraries o Bypassing CPython interpreter entirely and compiling to machine code

Python landscape for fine-grained parallelism • Options today: o Extension
modules or libraries: • Accelerate/NumbaPro (GPU, Multicore) • OpenCV • Intel MKL Libraries o Bypassing the CPython interpreter entirely by compiling Python to machine code: • Numba with threading • Cython with OpenMP • Options tomorrow (Python 4.x): o PyParallel? • Demonstrates it is possible to have multiple threads running CPython interpreter threads in parallel without incurring a performance overhead o PyPy-STM?

Python Landscape for Coarse-grained Parallelism • Rich ecosystem depending on
your problem: o https://wiki.python.org/moin/ParallelProcessing o batchlib, Celery, Deap, disco, dispy, DistributedPython, exec_proxy, execnet, IPython Parallel, jug, mpi4py, PaPy, pyMPI, pypar, pypvm, Pyro, rthread, SCOOP, seppo, superspy • Python stdlib options: o multiprocessing (since 2.6) o concurrent.futures (introduced in 3.2, backported to 2.7) • Common throughout: o Separate Python processes to achieve parallel execution

Python & the GIL • No talk on parallelism and
concurrency in Python would be complete without mentioning the GIL (global interpreter lock) • What is it? o A lock that ensures only one thread can execute CPython innards at any given time o Create 100 threading.Thread() instances… o ....and only one will run at any given time • So why even support threads if they can’t run in parallel? • Because they can be useful for blocking, I/O-bound problems o Ironically, they facilitate concurrency in Python, not parallelism • But they won’t solve your compute-bound problem any faster • Nor will you ever exploit more than one core

Exploiting multiple cores for compute-bound problems…

import multiprocessing • Added in Python 2.6 (2008) • Similar
interface to threading module • Uses separate Python processes behind the scenes

from multiprocessing import pros • It works • It’s in
the stdlib • It’s adequate for coarse-grained parallelism • It’ll use all my cores if I’m compute-bound

from multiprocessing import cons • Often sub-optimal depending on the
problem • Inadequate for fine-grained parallelism • Inadequate for I/O-driven problems (specifically socket servers) • Overhead of extra processes • No shared memory out of the box (I’d have to set it up myself) • Kinda’ quirky on Windows • The examples in the docs are trivialized and don’t really map to real world problems o https://docs.python.org/2/library/multiprocessing.html o i.e. x*x for x in [1, 2, 3, 4]

from multiprocessing import subtleties • Recap: contemporary data center hardware:
128 cores, 512GB RAM • I want to use multiprocessing to solve my compute-bound problem • And I want to optimally use my hardware; idle cores are useless • So how big should my multiprocessor pool be? How many processes? • 128, right?

128 cores = 128 processes? • Works fine… until you
need to do I/O • And you’re probably going to be doing blocking I/O o i.e. synchronous read/write calls o Non-blocking I/O is poorly suited to multiprocessing as you’d need to have per-process event-loops doing the syscall multiplexing dance • The problem is, as soon as you block, that’s one less process able to do useful work • Can quickly become pathological: o Start a pool of 64 processors (for 64 cores) o Few minutes later: only 20-25 active • Is the solution to create a bigger pool?

128 cores = 132 processes? 194? 256? • Simply increasing
the number of processes isn’t the solution • Results in pathological behavior on the opposite end of the spectrum o Instead of idle cores, you have over-scheduled cores o Significant overhead incurred by context switching • Cache pollution, TLB contention o You can visibly see this with basic tools like top: 20% user, 80% sys • Neither approach is optimal today: o processes <= ncpu: idle cores o processes > ncpu: over-scheduled cores

What do we really need? • We want to solve
our problems optimally on our powerful hardware • Avoid the sub-optimal: o Idleness o Blocking I/O o Context switching o Wasteful memory use • Encourage the optimal: o One active thread per core o Efficient memory use

One active thread per core • This is a subtlety
complex problem • Intrinsically dependent upon I/O facilities provided by the OS: o Readiness-oriented or completion-oriented? o Thread-agnostic I/O or thread-specific I/O? • Plus one critical element: o Disassociating the work (computation) from the worker (thread) o Associating a desired concurrency level (i.e. use all my cores) with the work • This allows the kernel to make intelligent thread dispatching decisions o Ensures only one active thread per core o No over-scheduling or unnecessary context switches • Turns out Windows is really good at all this

• IOCPs can be thought of as FIFO queues •
I/O manager pushes completion packets asynchronously • Threads pop completions off and process results: I/O Completion Ports do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP

IOCP and Concurrency • Set I/O completion port’s concurrency to
number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue • ....and schedules another thread to run do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

Windows and PyParallel • The Windows concurrency and synchronization primitives
and approach to asynchronous I/O is very well suited to what I wanted to do with PyParallel • Vista introduced new thread pool APIs • Tightly integrated into IOCP/overlapped ecosystem • Greatly reduces the amount of scaffolding code I needed to write to prototype the concept void PxSocketClient_Callback(); CreateThreadpoolIo(.., &PxSocketClient_Callback) .. StartThreadpoolIo(..) AcceptEx(..)/WSASend(..)/WSARecv(..) • That’s it. When the async I/O op completes, your callback gets invoked • Windows manages everything: optimal thread pool size, NUMA-cognizant dispatching • Didn’t need to create a single thread, no mutexes, none of the normal headaches that come with multithreading

Post-PyParallel • I now have the Python glue to optimally
exploit my hardware • But it’s still Python… • ….and Python can be kinda’ slow • Especially when doing computationally-intensive work o Especially especially when doing numerically-oriented computation • Enter… Numba!

Numba & JIT’ing via @decorators Numba

Image Filtering

Image Filtering: ~1500x Speedup

NumbaPro GPU Support

Black Scholes Speedup NumbaPro

Final Thoughts…

What do I want in Python 4.x? • Native cross-platform
PyParallel support • @jit hooks introduced in the stdlib • ….and an API for multiple downstream jitters to hook into o CPython broadcasts the AST/bytecode being executed in ceval to jitters o Multiple jitters running in separate threads o CPython: “Hey, can you optimize this chunk of Python? Let me know.” o Next time it encounters that chunk, it can check for optimal versions • Could provide a viable way of hooking in Numba, PyPy, Pythran, ShedSkin, etc, whilst still staying within the confines of CPython

Thanks! @ContinuumIO, @trentnelson [email protected] http://speakerdeck.com/trent/

(Backup slides)

I/O on Contemporary Windows Kernels (Vista+) • Fantastic support for
asynchronous I/O • Threads have been first class citizens since day 1 (not bolted on as an afterthought) • Designed to be programmed in a completion-oriented, multi-threaded fashion • Overlapped I/O + IOCP + threads + kernel synchronization primitives = excellent combo for achieving high performance

I/O Completion Ports • The best way to grok IOCP
is to understand the problem it was designed to solve: o Facilitate writing high-performance network/file servers (http, database, file server) o Extract maximum performance from multi-processor/multi-core hardware o (Which necessitates optimal resource usage)

IOCP: Goals • Extract maximum performance through parallelism o Thread
running on every core servicing a client request o Upon finishing a client request, immediately processes the next request if one is waiting o Never block o (And if you do block, handle it as optimally as possible) • Optimal resource usage o One active thread per core

On not blocking... • UNIX approach: o Set file descriptor
to non-blocking o Try read or write data o Get EAGAIN instead of blocking o Try again later • Windows approach o Create an overlapped I/O structure o Issue a read or write, passing the overlapped structure and completion port info o Call returns immediately o Read/write done asynchronously by I/O manager o Optional completion packet queued to the completion port a) on error, b) on completion. o Thread waiting on completion port de-queues completion packet and processes request

On not blocking... • UNIX approach: o Is this ready
to write yet yet? o No? How about now? o Still no? o Now? o Yes!? Really? Ok, write it! o Hi! Me again. Anything to read? o No? o How about now? • Windows approach: o Here, do this. Let me know when it’s done. Readiness-oriented Completion-oriented (reactor pattern) (proactor pattern)

On not blocking... • Windows provides an asynchronous/overlapped way to
do just about everything • Basically, if it could block, there’s a way to do it asynchronously in Windows • WSASend and WSARecv • AcceptEx() vs accept() • ConnectEx() vs connect() • DisconnectEx() vs close() • GetAddrinfoEx() vs getaddrinfo() (Windows 8+) • (And that’s just for sockets; all device I/O can be done asynchronously)

Thread-agnostic I/O with IOCP • Secret sauce behind asynchronous I/O
on Windows • IOCPs allow IRP completion (copying data from nonpaged kernel memory back to user’s buffer) to be deferred to a thread-agnostic queue • Any thread can wait on this queue (completion port) via GetQueuedCompletionStatus() • IRP completion done just before that call returns • Allows I/O manager to rapidly queue IRP completions • ....and waiting threads to instantly dequeue and process

• IOCPs can be thought of as FIFO queues •
I/O manager pushes completion packets asynchronously • Threads pop completions off and process results: IOCP and Concurrency do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP

IOCP and Concurrency • Remember IOCP design goals: o Maximize
performance o Optimize resource usage • Optimal number of active threads running per core: 1 • Optimal number of total threads running: 1 * ncpu • Windows can’t control how many threads you create and then have waiting against the completion port • But it can control when and how many threads get awoken • ….via the IOCP’s maximum concurrency value • (Specified when you create the IOCP)

number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue • ....and schedules another thread to run do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

Removing the GIL (Without needing to remove the GIL.)

So how does it work? • First, how it doesn’t
work: o No GIL removal • This was previously tried and rejected • Required fine-grained locking throughout the interpreter • Mutexes are expensive • Single-threaded execution significantly slower o Not using PyPy’s approach via Software Transactional Memory (STM) • Huge overhead • 64 threads trying to write to something, 1 wins, continues • 63 keep trying • 63 bottles of beer on the wall… • Doesn’t support “free threading” o Existing code using threading.Thread won’t magically run on all cores o You need to use the new async APIs

PyParallel’s Approach • Don’t touch the GIL o It’s great,
serves a very useful purpose • Instead, intercept all thread-sensitive calls: o Reference counting (Py_(INCREF|DECREF|CLEAR)) o Memory management (PyMem_(Malloc|Free), PyObject_(INIT|NEW)) o Free lists o Static C globals o Interned strings • If we’re the main thread, do what we normally do • However, if we’re a parallel thread, do a thread-safe alternative

Main thread or Parallel Thread? • “If we’re a parallel
thread, do X, if not, do Y” o X = thread-safe alternative o Y = what we normally do • “If we’re a parallel thread” o Thread-sensitive calls are ubiquitous o But we want to have a negligible performance impact o So the challenge is how quickly can we detect if we’re a parallel thread o The quicker we can detect it, the less overhead incurred

The Py_PXCTX macro “Are we running in a parallel context?”
#define Py_PXCTX (Py_MainThreadId != _Py_get_current_thread_id()) • What’s so special about _Py_get_current_thread_id()? o On Windows, you could use GetCurrentThreadId() o On POSIX, pthread_self() • Unnecessary overhead (this macro will be everywhere) • Is there a quicker way? • Can we determine if we’re running in a parallel context without needing a function call?

Windows Solution: Interrogate the TEB #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS
# include <intrin.h> # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() __readfsdword(0x20) # define _Py_get_current_thread_id() __readfsdword(0x24)

Py_PXCTX Example -#define _Py_ForgetReference(op) _Py_INC_TPFREES(op) +#define _Py_ForgetReference(op) \ + do
{ \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + } while (0) + +#endif /* WITH_PARALLEL */ • Py_PXCTX == (Py_MainThreadId == __readfsdword(0x48)) • Overhead reduced to a couple more instructions and an extra branch (cost of which can be eliminated by branch prediction) • That’s basically free compared to STM or fine-grained locking

PyParallel Advantages • Initial profiling results: 0.01% overhead incurred by
Py_PXCTX for normal single-threaded code o GIL removal: 40% overhead o PyPy’s STM: “200-500% slower” • Only touches a relatively small amount of code o No need for intrusive surgery like re-writing a thread-safe bucket memory allocator or garbage collector • Keeps GIL semantics o Important for legacy code o 3rd party libraries, C extension code • Code executing in parallel context has full visibility to “main thread objects” (in a read-only capacity, thus no need for locks)

PyParallel In Action • Things to note with the chargen
demo coming up: o One python_d.exe process o Constant memory use o CPU use proportional to concurrent client count (1 client = 25% CPU use) o Every 10,000 sends, a status message is printed • Depicts dynamically switching from synchronous sends to async sends • Illustrates awareness of active I/O hogs • Environment: o Macbook Pro, 8 core i7 2.2GHz, 8GB RAM o 1-5 netcat instances on OS X o Windows 7 instance running in Parallels, 4 cores, 3GB

1 Chargen (99/25%/67%) Num. Processes CPU% Mem%

2 Chargen (99/54%/67%)

3 Chargen (99/77%/67%)

4 Chargen (99/99%/68%)

5 Chargen?! (99/99%/67%)

Thanks! Follow us on Twitter for more PyParallel announcements! @ContinuumIO
@trentnelson http://continuum.io/

Parallelism and Concurrency with Python

Parallelism and Concurrency with Python

More Decks by Trent Nelson

Other Decks in Programming

Featured

Transcript