Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallelism and Concurrency with Python

Parallelism and Concurrency with Python

Presented May 2014.

For more information on PyParallel, see: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all-cores (recorded here: http://vimeo.com/79539317).

www.continuum.io
@trentnelson

A0e93b48a54900f6273c633915fb88cc?s=128

Trent Nelson

May 18, 2014
Tweet

More Decks by Trent Nelson

Other Decks in Programming

Transcript

  1. Parallelism and Concurrency with Python Trent Nelson Managing Director, New

    York Continuum Analytics @ContinuumIO, @trentnelson trent.nelson@continuum.io http://speakerdeck.com/trent/
  2. About Me • Systems Software Engineer • Core Python Committer

    • Apache/Subversion Committer • Founded Snakebite @ Michigan State University o AIX RS/6000 o SGI IRIX/MIPS o Alpha/Tru64 o Solaris/SPARC o HP-UX/IA64 o FreeBSD, NetBSD, OpenBSD, DragonFlyBSD • Background is UNIX • Made peace with Windows when XP came out • Author of PyParallel
  3. PyParallel • Set of modifications to CPython interpreter • Allows

    multiple interpreter threads to run in parallel without incurring any additional performance penalties • Intrinsically paired with Windows asynchronous I/O primitives (IOCP), Vista+ thread pools and kernel synchronization primitives • Working prototype/proof-of-concept after 3 months: o Performant HTTP server, written in Python, automatically exploits all cores: pyparallel.exe -m async.http.server
  4. Motivation behind PyParallel • What problem was I trying to

    solve? • Wasn’t happy with the status quo o Parallel options (for compute-bound, data parallelism problems): • GIL prevents simultaneous multithreading • ….so you have to rely on separate Python processes if you want to exploit more than one core o Concurrency options (for I/O-bound or I/O-driven, task parallelism problems): • One thread per client, blocking I/O • Single-thread, event loop, multiplexing system call (select/poll/epoll/kqueue)
  5. What if I’m I/O-bound and compute-bound? • Contemporary enterprise problems:

    o Computationally-intensive (compute-bound) work against TBs/PBs of data (I/O-bound) o Serving tens of thousands of network clients (I/O-driven) with non-trivial computation required per request (compute-bound) o Serving fewer clients, but providing ultra-low latency or maximum throughput to those you do serve (HFT, remote array servers, etc) • Contemporary data center hardware : o 128 cores, 512GB RAM o Quad 10Gb Ethernet NICs o SSDs & Fusion-IO style storage -> 500k-800k+ IOPS from a single device o 2016: 128Gb Fibre Channel (4x32Gb) -> 25.6GB/s throughput
  6. Real Problems, Powerful Hardware • I want to solve my

    problems as optimally as my hardware will allow • Optimal hardware use necessitates things like: o One active thread per core • Any more results in unnecessary context switches o No unnecessary duplication of shared/common data in memory o Ability to saturate the bandwidth of my I/O devices • And I want to do it all in Python • ....yet still be competitive against C/C++ where it matters • So for this talk: o What are my options today? o What might my options look like tomorrow?
  7. First, some definitions…

  8. Concurrency versus Parallelism • Concurrency: o Making progress on multiple

    things at the same time • Task A doesn’t need to complete before you can start work on task B o Typically used to describe I/O-bound or I/O-driven systems, especially network- oriented socket servers • Parallelism: o Making progress on one thing in multiple places at the same time • Task A is split into 8 parts, each part runs on a separate core o Typically used in compute-bound contexts • Map/reduce, aggregation, “embarrassingly parallelizable” data etc
  9. So for a given time frame T (1us, 1ms, 1s

    etc)… • Concurrency: how many things did I do? o Things = units of work (e.g. servicing network clients) o Performance benchmark: • How fast was everyone served? (i.e. request latency) • And were they served fairly? • Parallelism: how many things did I do them on? o Things = hardware units (e.g. CPU cores, GPU cores) o Performance benchmark : • How much did I get done? • How long did it take?
  10. Concurrent Python

  11. Concurrent Python • I/O-driven client/server systems (socket-oriented) • There are

    some pretty decent Python libraries out there geared toward concurrency o Twisted, Tornado, Tulip/asyncio (3.x), etc • Common themes: o Set all your sockets and file descriptors to non-blocking o Write your Python in an event-oriented fashion • def data_received(self, data): … • Hollywood Principle: don’t call us, we’ll call you o Appearance of asynchronous I/O achieved via single-threaded event loop with multiplexing system call • Biggest drawback: o Inherently limited to a single-core o Thus, inadequate for problems that are both concurrent and computationally bound
  12. Parallel Python

  13. Coarse-grained versus fine-grained parallelism • Coarse-grained (task parallelism) o Batch

    processing daily files o Data mining distinct segments/chunks/partitions o Process A runs on X data set, independent to process B running on Y data set • Fine-grained (data parallelism) o Map/reduce, divide & conquer, aggregation etc o Common theme: sequential execution, fan out to parallel against shared data set, collapse back down to sequential
  14. Coarse-grained versus fine-grained parallelism • Coarse-grained (multiple processes): o Typically

    adequate: using multiple processes that don’t need to talk to each other (or if they do, don’t need to talk often) o Depending on shared state, could still benefit if implemented via threads instead of processes • Better cache usage, less duplication of identical memory structures, less overhead overall • Fine-grained (multiple threads): o Typically optimal: using multiple threads within the same address space o IPC overhead can severely impact net performance when having to use processes instead of threads
  15. Python landscape for fine-grained parallelism • Python’s GIL (global interpreter

    lock) prevents more than one Python interpreter thread running at a given time • If you want to use multiple threads within the same Python process, you have to come up with a way to avoid the GIL o (Fine-grained parallelism =~ multithreading) • Today, this relies on: o Extension modules or libraries o Bypassing CPython interpreter entirely and compiling to machine code
  16. Python landscape for fine-grained parallelism • Options today: o Extension

    modules or libraries: • Accelerate/NumbaPro (GPU, Multicore) • OpenCV • Intel MKL Libraries o Bypassing the CPython interpreter entirely by compiling Python to machine code: • Numba with threading • Cython with OpenMP • Options tomorrow (Python 4.x): o PyParallel? • Demonstrates it is possible to have multiple threads running CPython interpreter threads in parallel without incurring a performance overhead o PyPy-STM?
  17. Python Landscape for Coarse-grained Parallelism • Rich ecosystem depending on

    your problem: o https://wiki.python.org/moin/ParallelProcessing o batchlib, Celery, Deap, disco, dispy, DistributedPython, exec_proxy, execnet, IPython Parallel, jug, mpi4py, PaPy, pyMPI, pypar, pypvm, Pyro, rthread, SCOOP, seppo, superspy • Python stdlib options: o multiprocessing (since 2.6) o concurrent.futures (introduced in 3.2, backported to 2.7) • Common throughout: o Separate Python processes to achieve parallel execution
  18. Python & the GIL • No talk on parallelism and

    concurrency in Python would be complete without mentioning the GIL (global interpreter lock) • What is it? o A lock that ensures only one thread can execute CPython innards at any given time o Create 100 threading.Thread() instances… o ....and only one will run at any given time • So why even support threads if they can’t run in parallel? • Because they can be useful for blocking, I/O-bound problems o Ironically, they facilitate concurrency in Python, not parallelism • But they won’t solve your compute-bound problem any faster • Nor will you ever exploit more than one core
  19. Exploiting multiple cores for compute-bound problems…

  20. import multiprocessing • Added in Python 2.6 (2008) • Similar

    interface to threading module • Uses separate Python processes behind the scenes
  21. from multiprocessing import pros • It works • It’s in

    the stdlib • It’s adequate for coarse-grained parallelism • It’ll use all my cores if I’m compute-bound
  22. from multiprocessing import cons • Often sub-optimal depending on the

    problem • Inadequate for fine-grained parallelism • Inadequate for I/O-driven problems (specifically socket servers) • Overhead of extra processes • No shared memory out of the box (I’d have to set it up myself) • Kinda’ quirky on Windows • The examples in the docs are trivialized and don’t really map to real world problems o https://docs.python.org/2/library/multiprocessing.html o i.e. x*x for x in [1, 2, 3, 4]
  23. from multiprocessing import subtleties • Recap: contemporary data center hardware:

    128 cores, 512GB RAM • I want to use multiprocessing to solve my compute-bound problem • And I want to optimally use my hardware; idle cores are useless • So how big should my multiprocessor pool be? How many processes? • 128, right?
  24. 128 cores = 128 processes? • Works fine… until you

    need to do I/O • And you’re probably going to be doing blocking I/O o i.e. synchronous read/write calls o Non-blocking I/O is poorly suited to multiprocessing as you’d need to have per-process event-loops doing the syscall multiplexing dance • The problem is, as soon as you block, that’s one less process able to do useful work • Can quickly become pathological: o Start a pool of 64 processors (for 64 cores) o Few minutes later: only 20-25 active • Is the solution to create a bigger pool?
  25. 128 cores = 132 processes? 194? 256? • Simply increasing

    the number of processes isn’t the solution • Results in pathological behavior on the opposite end of the spectrum o Instead of idle cores, you have over-scheduled cores o Significant overhead incurred by context switching • Cache pollution, TLB contention o You can visibly see this with basic tools like top: 20% user, 80% sys • Neither approach is optimal today: o processes <= ncpu: idle cores o processes > ncpu: over-scheduled cores
  26. What do we really need? • We want to solve

    our problems optimally on our powerful hardware • Avoid the sub-optimal: o Idleness o Blocking I/O o Context switching o Wasteful memory use • Encourage the optimal: o One active thread per core o Efficient memory use
  27. One active thread per core • This is a subtlety

    complex problem • Intrinsically dependent upon I/O facilities provided by the OS: o Readiness-oriented or completion-oriented? o Thread-agnostic I/O or thread-specific I/O? • Plus one critical element: o Disassociating the work (computation) from the worker (thread) o Associating a desired concurrency level (i.e. use all my cores) with the work • This allows the kernel to make intelligent thread dispatching decisions o Ensures only one active thread per core o No over-scheduling or unnecessary context switches • Turns out Windows is really good at all this
  28. • IOCPs can be thought of as FIFO queues •

    I/O manager pushes completion packets asynchronously • Threads pop completions off and process results: I/O Completion Ports do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP
  29. IOCP and Concurrency • Set I/O completion port’s concurrency to

    number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  30. IOCP and Concurrency • Set I/O completion port’s concurrency to

    number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  31. IOCP and Concurrency • Set I/O completion port’s concurrency to

    number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue • ....and schedules another thread to run do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  32. Windows and PyParallel • The Windows concurrency and synchronization primitives

    and approach to asynchronous I/O is very well suited to what I wanted to do with PyParallel • Vista introduced new thread pool APIs • Tightly integrated into IOCP/overlapped ecosystem • Greatly reduces the amount of scaffolding code I needed to write to prototype the concept void PxSocketClient_Callback(); CreateThreadpoolIo(.., &PxSocketClient_Callback) .. StartThreadpoolIo(..) AcceptEx(..)/WSASend(..)/WSARecv(..) • That’s it. When the async I/O op completes, your callback gets invoked • Windows manages everything: optimal thread pool size, NUMA-cognizant dispatching • Didn’t need to create a single thread, no mutexes, none of the normal headaches that come with multithreading
  33. Post-PyParallel • I now have the Python glue to optimally

    exploit my hardware • But it’s still Python… • ….and Python can be kinda’ slow • Especially when doing computationally-intensive work o Especially especially when doing numerically-oriented computation • Enter… Numba!
  34. Numba & JIT’ing via @decorators Numba

  35. Image Filtering

  36. Image Filtering: ~1500x Speedup

  37. NumbaPro GPU Support

  38. Black Scholes Speedup NumbaPro

  39. Final Thoughts…

  40. What do I want in Python 4.x? • Native cross-platform

    PyParallel support • @jit hooks introduced in the stdlib • ….and an API for multiple downstream jitters to hook into o CPython broadcasts the AST/bytecode being executed in ceval to jitters o Multiple jitters running in separate threads o CPython: “Hey, can you optimize this chunk of Python? Let me know.” o Next time it encounters that chunk, it can check for optimal versions • Could provide a viable way of hooking in Numba, PyPy, Pythran, ShedSkin, etc, whilst still staying within the confines of CPython
  41. Thanks! @ContinuumIO, @trentnelson trent.nelson@continuum.io http://speakerdeck.com/trent/

  42. (Backup slides)

  43. I/O on Contemporary Windows Kernels (Vista+) • Fantastic support for

    asynchronous I/O • Threads have been first class citizens since day 1 (not bolted on as an afterthought) • Designed to be programmed in a completion-oriented, multi-threaded fashion • Overlapped I/O + IOCP + threads + kernel synchronization primitives = excellent combo for achieving high performance
  44. I/O Completion Ports • The best way to grok IOCP

    is to understand the problem it was designed to solve: o Facilitate writing high-performance network/file servers (http, database, file server) o Extract maximum performance from multi-processor/multi-core hardware o (Which necessitates optimal resource usage)
  45. IOCP: Goals • Extract maximum performance through parallelism o Thread

    running on every core servicing a client request o Upon finishing a client request, immediately processes the next request if one is waiting o Never block o (And if you do block, handle it as optimally as possible) • Optimal resource usage o One active thread per core
  46. On not blocking... • UNIX approach: o Set file descriptor

    to non-blocking o Try read or write data o Get EAGAIN instead of blocking o Try again later • Windows approach o Create an overlapped I/O structure o Issue a read or write, passing the overlapped structure and completion port info o Call returns immediately o Read/write done asynchronously by I/O manager o Optional completion packet queued to the completion port a) on error, b) on completion. o Thread waiting on completion port de-queues completion packet and processes request
  47. On not blocking... • UNIX approach: o Is this ready

    to write yet yet? o No? How about now? o Still no? o Now? o Yes!? Really? Ok, write it! o Hi! Me again. Anything to read? o No? o How about now? • Windows approach: o Here, do this. Let me know when it’s done. Readiness-oriented Completion-oriented (reactor pattern) (proactor pattern)
  48. On not blocking... • Windows provides an asynchronous/overlapped way to

    do just about everything • Basically, if it could block, there’s a way to do it asynchronously in Windows • WSASend and WSARecv • AcceptEx() vs accept() • ConnectEx() vs connect() • DisconnectEx() vs close() • GetAddrinfoEx() vs getaddrinfo() (Windows 8+) • (And that’s just for sockets; all device I/O can be done asynchronously)
  49. Thread-agnostic I/O with IOCP • Secret sauce behind asynchronous I/O

    on Windows • IOCPs allow IRP completion (copying data from nonpaged kernel memory back to user’s buffer) to be deferred to a thread-agnostic queue • Any thread can wait on this queue (completion port) via GetQueuedCompletionStatus() • IRP completion done just before that call returns • Allows I/O manager to rapidly queue IRP completions • ....and waiting threads to instantly dequeue and process
  50. • IOCPs can be thought of as FIFO queues •

    I/O manager pushes completion packets asynchronously • Threads pop completions off and process results: IOCP and Concurrency do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP
  51. IOCP and Concurrency • Remember IOCP design goals: o Maximize

    performance o Optimize resource usage • Optimal number of active threads running per core: 1 • Optimal number of total threads running: 1 * ncpu • Windows can’t control how many threads you create and then have waiting against the completion port • But it can control when and how many threads get awoken • ….via the IOCP’s maximum concurrency value • (Specified when you create the IOCP)
  52. IOCP and Concurrency • Set I/O completion port’s concurrency to

    number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  53. IOCP and Concurrency • Set I/O completion port’s concurrency to

    number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  54. IOCP and Concurrency • Set I/O completion port’s concurrency to

    number of CPUs/cores (2) • Create double the number of threads (4) • An active thread does something that blocks (i.e. file I/O) • Windows can detect that the active thread count (1) has dropped below max concurrency (2) and that there are still outstanding packets in the completion queue • ....and schedules another thread to run do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  55. Removing the GIL (Without needing to remove the GIL.)

  56. So how does it work? • First, how it doesn’t

    work: o No GIL removal • This was previously tried and rejected • Required fine-grained locking throughout the interpreter • Mutexes are expensive • Single-threaded execution significantly slower o Not using PyPy’s approach via Software Transactional Memory (STM) • Huge overhead • 64 threads trying to write to something, 1 wins, continues • 63 keep trying • 63 bottles of beer on the wall… • Doesn’t support “free threading” o Existing code using threading.Thread won’t magically run on all cores o You need to use the new async APIs
  57. PyParallel’s Approach • Don’t touch the GIL o It’s great,

    serves a very useful purpose • Instead, intercept all thread-sensitive calls: o Reference counting (Py_(INCREF|DECREF|CLEAR)) o Memory management (PyMem_(Malloc|Free), PyObject_(INIT|NEW)) o Free lists o Static C globals o Interned strings • If we’re the main thread, do what we normally do • However, if we’re a parallel thread, do a thread-safe alternative
  58. Main thread or Parallel Thread? • “If we’re a parallel

    thread, do X, if not, do Y” o X = thread-safe alternative o Y = what we normally do • “If we’re a parallel thread” o Thread-sensitive calls are ubiquitous o But we want to have a negligible performance impact o So the challenge is how quickly can we detect if we’re a parallel thread o The quicker we can detect it, the less overhead incurred
  59. The Py_PXCTX macro “Are we running in a parallel context?”

    #define Py_PXCTX (Py_MainThreadId != _Py_get_current_thread_id()) • What’s so special about _Py_get_current_thread_id()? o On Windows, you could use GetCurrentThreadId() o On POSIX, pthread_self() • Unnecessary overhead (this macro will be everywhere) • Is there a quicker way? • Can we determine if we’re running in a parallel context without needing a function call?
  60. Windows Solution: Interrogate the TEB #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS

    # include <intrin.h> # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() __readfsdword(0x20) # define _Py_get_current_thread_id() __readfsdword(0x24)
  61. Py_PXCTX Example -#define _Py_ForgetReference(op) _Py_INC_TPFREES(op) +#define _Py_ForgetReference(op) \ + do

    { \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + } while (0) + +#endif /* WITH_PARALLEL */ • Py_PXCTX == (Py_MainThreadId == __readfsdword(0x48)) • Overhead reduced to a couple more instructions and an extra branch (cost of which can be eliminated by branch prediction) • That’s basically free compared to STM or fine-grained locking
  62. PyParallel Advantages • Initial profiling results: 0.01% overhead incurred by

    Py_PXCTX for normal single-threaded code o GIL removal: 40% overhead o PyPy’s STM: “200-500% slower” • Only touches a relatively small amount of code o No need for intrusive surgery like re-writing a thread-safe bucket memory allocator or garbage collector • Keeps GIL semantics o Important for legacy code o 3rd party libraries, C extension code • Code executing in parallel context has full visibility to “main thread objects” (in a read-only capacity, thus no need for locks)
  63. PyParallel In Action • Things to note with the chargen

    demo coming up: o One python_d.exe process o Constant memory use o CPU use proportional to concurrent client count (1 client = 25% CPU use) o Every 10,000 sends, a status message is printed • Depicts dynamically switching from synchronous sends to async sends • Illustrates awareness of active I/O hogs • Environment: o Macbook Pro, 8 core i7 2.2GHz, 8GB RAM o 1-5 netcat instances on OS X o Windows 7 instance running in Parallels, 4 cores, 3GB
  64. 1 Chargen (99/25%/67%) Num. Processes CPU% Mem%

  65. 2 Chargen (99/54%/67%)

  66. 3 Chargen (99/77%/67%)

  67. 4 Chargen (99/99%/68%)

  68. 5 Chargen?! (99/99%/67%)

  69. Thanks! Follow us on Twitter for more PyParallel announcements! @ContinuumIO

    @trentnelson http://continuum.io/