Slide 1

Slide 1 text

Parallelism and Concurrency with Python Trent Nelson Managing Director, New York Continuum Analytics @ContinuumIO, @trentnelson [email protected] http://speakerdeck.com/trent/

Slide 2

Slide 2 text

About Me • Systems Software Engineer • Core Python Committer • Apache/Subversion Committer • Founded Snakebite @ Michigan State University o AIX RS/6000 o SGI IRIX/MIPS o Alpha/Tru64 o Solaris/SPARC o HP-UX/IA64 o FreeBSD, NetBSD, OpenBSD, DragonFlyBSD • Background is UNIX • Made peace with Windows when XP came out • Author of PyParallel

Slide 3

Slide 3 text

PyParallel • Set of modifications to CPython interpreter • Allows multiple interpreter threads to run in parallel without incurring any additional performance penalties • Intrinsically paired with Windows asynchronous I/O primitives (IOCP), Vista+ thread pools and kernel synchronization primitives • Working prototype/proof-of-concept after 3 months: o Performant HTTP server, written in Python, automatically exploits all cores: pyparallel.exe -m async.http.server

Slide 4

Slide 4 text

Motivation behind PyParallel • What problem was I trying to solve? • Wasn’t happy with the status quo o Parallel options (for compute-bound, data parallelism problems): • GIL prevents simultaneous multithreading • ….so you have to rely on separate Python processes if you want to exploit more than one core o Concurrency options (for I/O-bound or I/O-driven, task parallelism problems): • One thread per client, blocking I/O • Single-thread, event loop, multiplexing system call (select/poll/epoll/kqueue)

Slide 5

Slide 5 text

What if I’m I/O-bound and compute-bound? • Contemporary enterprise problems: o Computationally-intensive (compute-bound) work against TBs/PBs of data (I/O-bound) o Serving tens of thousands of network clients (I/O-driven) with non-trivial computation required per request (compute-bound) o Serving fewer clients, but providing ultra-low latency or maximum throughput to those you do serve (HFT, remote array servers, etc) • Contemporary data center hardware : o 128 cores, 512GB RAM o Quad 10Gb Ethernet NICs o SSDs & Fusion-IO style storage -> 500k-800k+ IOPS from a single device o 2016: 128Gb Fibre Channel (4x32Gb) -> 25.6GB/s throughput

Slide 6

Slide 6 text

Real Problems, Powerful Hardware • I want to solve my problems as optimally as my hardware will allow • Optimal hardware use necessitates things like: o One active thread per core • Any more results in unnecessary context switches o No unnecessary duplication of shared/common data in memory o Ability to saturate the bandwidth of my I/O devices • And I want to do it all in Python • ....yet still be competitive against C/C++ where it matters • So for this talk: o What are my options today? o What might my options look like tomorrow?

Slide 7

Slide 7 text

First, some definitions…

Slide 8

Slide 8 text

Concurrency versus Parallelism • Concurrency: o Making progress on multiple things at the same time • Task A doesn’t need to complete before you can start work on task B o Typically used to describe I/O-bound or I/O-driven systems, especially network- oriented socket servers • Parallelism: o Making progress on one thing in multiple places at the same time • Task A is split into 8 parts, each part runs on a separate core o Typically used in compute-bound contexts • Map/reduce, aggregation, “embarrassingly parallelizable” data etc

Slide 9

Slide 9 text

So for a given time frame T (1us, 1ms, 1s etc)… • Concurrency: how many things did I do? o Things = units of work (e.g. servicing network clients) o Performance benchmark: • How fast was everyone served? (i.e. request latency) • And were they served fairly? • Parallelism: how many things did I do them on? o Things = hardware units (e.g. CPU cores, GPU cores) o Performance benchmark : • How much did I get done? • How long did it take?

Slide 10

Slide 10 text

Concurrent Python

Slide 11

Slide 11 text

Concurrent Python • I/O-driven client/server systems (socket-oriented) • There are some pretty decent Python libraries out there geared toward concurrency o Twisted, Tornado, Tulip/asyncio (3.x), etc • Common themes: o Set all your sockets and file descriptors to non-blocking o Write your Python in an event-oriented fashion • def data_received(self, data): … • Hollywood Principle: don’t call us, we’ll call you o Appearance of asynchronous I/O achieved via single-threaded event loop with multiplexing system call • Biggest drawback: o Inherently limited to a single-core o Thus, inadequate for problems that are both concurrent and computationally bound

Slide 12

Slide 12 text

Parallel Python

Slide 13

Slide 13 text

Coarse-grained versus fine-grained parallelism • Coarse-grained (task parallelism) o Batch processing daily files o Data mining distinct segments/chunks/partitions o Process A runs on X data set, independent to process B running on Y data set • Fine-grained (data parallelism) o Map/reduce, divide & conquer, aggregation etc o Common theme: sequential execution, fan out to parallel against shared data set, collapse back down to sequential

Slide 14

Slide 14 text

Coarse-grained versus fine-grained parallelism • Coarse-grained (multiple processes): o Typically adequate: using multiple processes that don’t need to talk to each other (or if they do, don’t need to talk often) o Depending on shared state, could still benefit if implemented via threads instead of processes • Better cache usage, less duplication of identical memory structures, less overhead overall • Fine-grained (multiple threads): o Typically optimal: using multiple threads within the same address space o IPC overhead can severely impact net performance when having to use processes instead of threads

Slide 15

Slide 15 text

Python landscape for fine-grained parallelism • Python’s GIL (global interpreter lock) prevents more than one Python interpreter thread running at a given time • If you want to use multiple threads within the same Python process, you have to come up with a way to avoid the GIL o (Fine-grained parallelism =~ multithreading) • Today, this relies on: o Extension modules or libraries o Bypassing CPython interpreter entirely and compiling to machine code

Slide 16

Slide 16 text

Python landscape for fine-grained parallelism • Options today: o Extension modules or libraries: • Accelerate/NumbaPro (GPU, Multicore) • OpenCV • Intel MKL Libraries o Bypassing the CPython interpreter entirely by compiling Python to machine code: • Numba with threading • Cython with OpenMP • Options tomorrow (Python 4.x): o PyParallel? • Demonstrates it is possible to have multiple threads running CPython interpreter threads in parallel without incurring a performance overhead o PyPy-STM?

Slide 17

Slide 17 text

Python Landscape for Coarse-grained Parallelism • Rich ecosystem depending on your problem: o https://wiki.python.org/moin/ParallelProcessing o batchlib, Celery, Deap, disco, dispy, DistributedPython, exec_proxy, execnet, IPython Parallel, jug, mpi4py, PaPy, pyMPI, pypar, pypvm, Pyro, rthread, SCOOP, seppo, superspy • Python stdlib options: o multiprocessing (since 2.6) o concurrent.futures (introduced in 3.2, backported to 2.7) • Common throughout: o Separate Python processes to achieve parallel execution

Slide 18

Slide 18 text

Python & the GIL • No talk on parallelism and concurrency in Python would be complete without mentioning the GIL (global interpreter lock) • What is it? o A lock that ensures only one thread can execute CPython innards at any given time o Create 100 threading.Thread() instances… o ....and only one will run at any given time • So why even support threads if they can’t run in parallel? • Because they can be useful for blocking, I/O-bound problems o Ironically, they facilitate concurrency in Python, not parallelism • But they won’t solve your compute-bound problem any faster • Nor will you ever exploit more than one core

Slide 19

Slide 19 text

Exploiting multiple cores for compute-bound problems…

Slide 20

Slide 20 text

import multiprocessing • Added in Python 2.6 (2008) • Similar interface to threading module • Uses separate Python processes behind the scenes

Slide 21

Slide 21 text

from multiprocessing import pros • It works • It’s in the stdlib • It’s adequate for coarse-grained parallelism • It’ll use all my cores if I’m compute-bound

Slide 22

Slide 22 text

from multiprocessing import cons • Often sub-optimal depending on the problem • Inadequate for fine-grained parallelism • Inadequate for I/O-driven problems (specifically socket servers) • Overhead of extra processes • No shared memory out of the box (I’d have to set it up myself) • Kinda’ quirky on Windows • The examples in the docs are trivialized and don’t really map to real world problems o https://docs.python.org/2/library/multiprocessing.html o i.e. x*x for x in [1, 2, 3, 4]

Slide 23

Slide 23 text

from multiprocessing import subtleties • Recap: contemporary data center hardware: 128 cores, 512GB RAM • I want to use multiprocessing to solve my compute-bound problem • And I want to optimally use my hardware; idle cores are useless • So how big should my multiprocessor pool be? How many processes? • 128, right?

Slide 24

Slide 24 text

128 cores = 128 processes? • Works fine… until you need to do I/O • And you’re probably going to be doing blocking I/O o i.e. synchronous read/write calls o Non-blocking I/O is poorly suited to multiprocessing as you’d need to have per-process event-loops doing the syscall multiplexing dance • The problem is, as soon as you block, that’s one less process able to do useful work • Can quickly become pathological: o Start a pool of 64 processors (for 64 cores) o Few minutes later: only 20-25 active • Is the solution to create a bigger pool?

Slide 25

Slide 25 text

128 cores = 132 processes? 194? 256? • Simply increasing the number of processes isn’t the solution • Results in pathological behavior on the opposite end of the spectrum o Instead of idle cores, you have over-scheduled cores o Significant overhead incurred by context switching • Cache pollution, TLB contention o You can visibly see this with basic tools like top: 20% user, 80% sys • Neither approach is optimal today: o processes <= ncpu: idle cores o processes > ncpu: over-scheduled cores

Slide 26

Slide 26 text

What do we really need? • We want to solve our problems optimally on our powerful hardware • Avoid the sub-optimal: o Idleness o Blocking I/O o Context switching o Wasteful memory use • Encourage the optimal: o One active thread per core o Efficient memory use

Slide 27

Slide 27 text

One active thread per core • This is a subtlety complex problem • Intrinsically dependent upon I/O facilities provided by the OS: o Readiness-oriented or completion-oriented? o Thread-agnostic I/O or thread-specific I/O? • Plus one critical element: o Disassociating the work (computation) from the worker (thread) o Associating a desired concurrency level (i.e. use all my cores) with the work • This allows the kernel to make intelligent thread dispatching decisions o Ensures only one active thread per core o No over-scheduling or unnecessary context switches • Turns out Windows is really good at all this

Slide 28

Slide 28 text

• IOCPs can be thought of as FIFO queues • I/O manager pushes completion packets asynchronously • Threads pop completions off and process results: I/O Completion Ports do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP

Slide 29

Slide 29 text

IOCP and Concurrency • Set I/O completion port’s concurrency to number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

Slide 30

Slide 30 text

IOCP and Concurrency • Set I/O completion port’s concurrency to number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

Slide 31

Slide 31 text

IOCP and Concurrency • Set I/O completion port’s concurrency to number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue • ....and schedules another thread to run do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

Slide 32

Slide 32 text

Windows and PyParallel • The Windows concurrency and synchronization primitives and approach to asynchronous I/O is very well suited to what I wanted to do with PyParallel • Vista introduced new thread pool APIs • Tightly integrated into IOCP/overlapped ecosystem • Greatly reduces the amount of scaffolding code I needed to write to prototype the concept void PxSocketClient_Callback(); CreateThreadpoolIo(.., &PxSocketClient_Callback) .. StartThreadpoolIo(..) AcceptEx(..)/WSASend(..)/WSARecv(..) • That’s it. When the async I/O op completes, your callback gets invoked • Windows manages everything: optimal thread pool size, NUMA-cognizant dispatching • Didn’t need to create a single thread, no mutexes, none of the normal headaches that come with multithreading

Slide 33

Slide 33 text

Post-PyParallel • I now have the Python glue to optimally exploit my hardware • But it’s still Python… • ….and Python can be kinda’ slow • Especially when doing computationally-intensive work o Especially especially when doing numerically-oriented computation • Enter… Numba!

Slide 34

Slide 34 text

Numba & JIT’ing via @decorators Numba

Slide 35

Slide 35 text

Image Filtering

Slide 36

Slide 36 text

Image Filtering: ~1500x Speedup

Slide 37

Slide 37 text

NumbaPro GPU Support

Slide 38

Slide 38 text

Black Scholes Speedup NumbaPro

Slide 39

Slide 39 text

Final Thoughts…

Slide 40

Slide 40 text

What do I want in Python 4.x? • Native cross-platform PyParallel support • @jit hooks introduced in the stdlib • ….and an API for multiple downstream jitters to hook into o CPython broadcasts the AST/bytecode being executed in ceval to jitters o Multiple jitters running in separate threads o CPython: “Hey, can you optimize this chunk of Python? Let me know.” o Next time it encounters that chunk, it can check for optimal versions • Could provide a viable way of hooking in Numba, PyPy, Pythran, ShedSkin, etc, whilst still staying within the confines of CPython

Slide 41

Slide 41 text

Thanks! @ContinuumIO, @trentnelson [email protected] http://speakerdeck.com/trent/

Slide 42

Slide 42 text

(Backup slides)

Slide 43

Slide 43 text

I/O on Contemporary Windows Kernels (Vista+) • Fantastic support for asynchronous I/O • Threads have been first class citizens since day 1 (not bolted on as an afterthought) • Designed to be programmed in a completion-oriented, multi-threaded fashion • Overlapped I/O + IOCP + threads + kernel synchronization primitives = excellent combo for achieving high performance

Slide 44

Slide 44 text

I/O Completion Ports • The best way to grok IOCP is to understand the problem it was designed to solve: o Facilitate writing high-performance network/file servers (http, database, file server) o Extract maximum performance from multi-processor/multi-core hardware o (Which necessitates optimal resource usage)

Slide 45

Slide 45 text

IOCP: Goals • Extract maximum performance through parallelism o Thread running on every core servicing a client request o Upon finishing a client request, immediately processes the next request if one is waiting o Never block o (And if you do block, handle it as optimally as possible) • Optimal resource usage o One active thread per core

Slide 46

Slide 46 text

On not blocking... • UNIX approach: o Set file descriptor to non-blocking o Try read or write data o Get EAGAIN instead of blocking o Try again later • Windows approach o Create an overlapped I/O structure o Issue a read or write, passing the overlapped structure and completion port info o Call returns immediately o Read/write done asynchronously by I/O manager o Optional completion packet queued to the completion port a) on error, b) on completion. o Thread waiting on completion port de-queues completion packet and processes request

Slide 47

Slide 47 text

On not blocking... • UNIX approach: o Is this ready to write yet yet? o No? How about now? o Still no? o Now? o Yes!? Really? Ok, write it! o Hi! Me again. Anything to read? o No? o How about now? • Windows approach: o Here, do this. Let me know when it’s done. Readiness-oriented Completion-oriented (reactor pattern) (proactor pattern)

Slide 48

Slide 48 text

On not blocking... • Windows provides an asynchronous/overlapped way to do just about everything • Basically, if it could block, there’s a way to do it asynchronously in Windows • WSASend and WSARecv • AcceptEx() vs accept() • ConnectEx() vs connect() • DisconnectEx() vs close() • GetAddrinfoEx() vs getaddrinfo() (Windows 8+) • (And that’s just for sockets; all device I/O can be done asynchronously)

Slide 49

Slide 49 text

Thread-agnostic I/O with IOCP • Secret sauce behind asynchronous I/O on Windows • IOCPs allow IRP completion (copying data from nonpaged kernel memory back to user’s buffer) to be deferred to a thread-agnostic queue • Any thread can wait on this queue (completion port) via GetQueuedCompletionStatus() • IRP completion done just before that call returns • Allows I/O manager to rapidly queue IRP completions • ....and waiting threads to instantly dequeue and process

Slide 50

Slide 50 text

• IOCPs can be thought of as FIFO queues • I/O manager pushes completion packets asynchronously • Threads pop completions off and process results: IOCP and Concurrency do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP

Slide 51

Slide 51 text

IOCP and Concurrency • Remember IOCP design goals: o Maximize performance o Optimize resource usage • Optimal number of active threads running per core: 1 • Optimal number of total threads running: 1 * ncpu • Windows can’t control how many threads you create and then have waiting against the completion port • But it can control when and how many threads get awoken • ….via the IOCP’s maximum concurrency value • (Specified when you create the IOCP)

Slide 52

Slide 52 text

IOCP and Concurrency • Set I/O completion port’s concurrency to number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

Slide 53

Slide 53 text

IOCP and Concurrency • Set I/O completion port’s concurrency to number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

Slide 54

Slide 54 text

IOCP and Concurrency • Set I/O completion port’s concurrency to number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue • ....and schedules another thread to run do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2

Slide 55

Slide 55 text

Removing the GIL (Without needing to remove the GIL.)

Slide 56

Slide 56 text

So how does it work? • First, how it doesn’t work: o No GIL removal • This was previously tried and rejected • Required fine-grained locking throughout the interpreter • Mutexes are expensive • Single-threaded execution significantly slower o Not using PyPy’s approach via Software Transactional Memory (STM) • Huge overhead • 64 threads trying to write to something, 1 wins, continues • 63 keep trying • 63 bottles of beer on the wall… • Doesn’t support “free threading” o Existing code using threading.Thread won’t magically run on all cores o You need to use the new async APIs

Slide 57

Slide 57 text

PyParallel’s Approach • Don’t touch the GIL o It’s great, serves a very useful purpose • Instead, intercept all thread-sensitive calls: o Reference counting (Py_(INCREF|DECREF|CLEAR)) o Memory management (PyMem_(Malloc|Free), PyObject_(INIT|NEW)) o Free lists o Static C globals o Interned strings • If we’re the main thread, do what we normally do • However, if we’re a parallel thread, do a thread-safe alternative

Slide 58

Slide 58 text

Main thread or Parallel Thread? • “If we’re a parallel thread, do X, if not, do Y” o X = thread-safe alternative o Y = what we normally do • “If we’re a parallel thread” o Thread-sensitive calls are ubiquitous o But we want to have a negligible performance impact o So the challenge is how quickly can we detect if we’re a parallel thread o The quicker we can detect it, the less overhead incurred

Slide 59

Slide 59 text

The Py_PXCTX macro “Are we running in a parallel context?” #define Py_PXCTX (Py_MainThreadId != _Py_get_current_thread_id()) • What’s so special about _Py_get_current_thread_id()? o On Windows, you could use GetCurrentThreadId() o On POSIX, pthread_self() • Unnecessary overhead (this macro will be everywhere) • Is there a quicker way? • Can we determine if we’re running in a parallel context without needing a function call?

Slide 60

Slide 60 text

Windows Solution: Interrogate the TEB #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS # include # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() __readfsdword(0x20) # define _Py_get_current_thread_id() __readfsdword(0x24)

Slide 61

Slide 61 text

Py_PXCTX Example -#define _Py_ForgetReference(op) _Py_INC_TPFREES(op) +#define _Py_ForgetReference(op) \ + do { \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + } while (0) + +#endif /* WITH_PARALLEL */ • Py_PXCTX == (Py_MainThreadId == __readfsdword(0x48)) • Overhead reduced to a couple more instructions and an extra branch (cost of which can be eliminated by branch prediction) • That’s basically free compared to STM or fine-grained locking

Slide 62

Slide 62 text

PyParallel Advantages • Initial profiling results: 0.01% overhead incurred by Py_PXCTX for normal single-threaded code o GIL removal: 40% overhead o PyPy’s STM: “200-500% slower” • Only touches a relatively small amount of code o No need for intrusive surgery like re-writing a thread-safe bucket memory allocator or garbage collector • Keeps GIL semantics o Important for legacy code o 3rd party libraries, C extension code • Code executing in parallel context has full visibility to “main thread objects” (in a read-only capacity, thus no need for locks)

Slide 63

Slide 63 text

PyParallel In Action • Things to note with the chargen demo coming up: o One python_d.exe process o Constant memory use o CPU use proportional to concurrent client count (1 client = 25% CPU use) o Every 10,000 sends, a status message is printed • Depicts dynamically switching from synchronous sends to async sends • Illustrates awareness of active I/O hogs • Environment: o Macbook Pro, 8 core i7 2.2GHz, 8GB RAM o 1-5 netcat instances on OS X o Windows 7 instance running in Parallels, 4 cores, 3GB

Slide 64

Slide 64 text

1 Chargen (99/25%/67%) Num. Processes CPU% Mem%

Slide 65

Slide 65 text

2 Chargen (99/54%/67%)

Slide 66

Slide 66 text

3 Chargen (99/77%/67%)

Slide 67

Slide 67 text

4 Chargen (99/99%/68%)

Slide 68

Slide 68 text

5 Chargen?! (99/99%/67%)

Slide 69

Slide 69 text

Thanks! Follow us on Twitter for more PyParallel announcements! @ContinuumIO @trentnelson http://continuum.io/