$30 off During Our Annual Pro Sale. View Details »

Parallelizing the Python Interpreter: An Altern...

Parallelizing the Python Interpreter: An Alternate Approach to Async

This is the short overview presentation I gave to Python developers at the PyCon 2013 Language Summit.

Trent Nelson

March 14, 2013
Tweet

More Decks by Trent Nelson

Other Decks in Programming

Transcript

  1. Background/Motivation •  Two parts: •  “Parallelizing” the interpreter •  Allowing

    multiple threads to run CPython internals concurrently •  True “asynchronicity” •  Leverage the parallel facilities above to expose a “truly asynchronous” async API to Python code •  (Allow Python code to run concurrently within the same interpreter) •  I’ve had a vague idea for how to go about the parallel aspect for a while •  The async discussions on python-ideas last year provided the motivation to try and tie both things together •  Also figured it would be a great pet project to better familiarize myself with CPython internals
  2. Details •  Have been working on it since late-December-ish (full

    time until mid Feb-ish) •  It all lives in hg.python.org/sandbox/trent, px branch •  Include/object.h •  Include/pyparallel.h •  Python/pyparallel.c & pyparallel_private.h •  Vast majority of code •  The code quality is… ultra proof-of-concept/prototype/ hackfest •  Think of a how a homeless hoarder would treat his shopping cart •  Everything gets added, nothing gets removed •  ….so if you’re hoping to quickly grok how it all works by reviewing the code… you’re going to have a bad time. •  Only works on Vista+ (for now) •  Aim is to get everything working first, then refactor/review, review support for other platforms etc
  3. How it’s exposed to Python: The Async Façade •  Submission

    of arbitrary “work”: async.submit_work(func, args, kwds, callback=None, errback=None) •  Calls func(args, kwds) from a parallel thread* •  Submission of timers: async.submit_timer(time, func, args, kwds, …) async.submit_timer(interval, func, args, kwds, …) •  Calls func(args, kwds) from a parallel thread some ‘time’ in the future, or every interval. •  Submission of “waits”: async.submit_wait(obj, func, args, kwds, …) •  Calls func(args, kwds) from a parallel thread when ‘obj’ is signalled
  4. The Async Façade (cont.) •  Asynchronous file I/O f =

    async.open(filename, ‘wb’) async.write(f, buf, callback=…, errback=…) •  Asynchronous client/server services client = async.client(host, port) server = async.server(host, port) async.register(transport=client, protocol=…) async.register(transport=server, protocol=…) async.run()

  5. Chargen: Character Generator def chargen(lineno, nchars=72): start = ord(' ')

    end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b
  6. Demo: An Async Chargen Service import async class Chargen: def

    initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3', 20019) async.register(transport=server, protocol=Chargen) async.run()
  7. Things to note with the demo… •  Constant memory use

    •  Only one python_d.exe process… •  ….exploiting all cores •  ....without any explicit “thread” usage (i.e. no threading.Thread instances) •  (Not that threading.Thread instances would automatically exploit all cores)
  8. How it works… •  No GIL removal •  No fine-grained

    locking •  Not using STM •  No “free threading” •  i.e. concurrency isn’t achieved by exposing a threading.Thread-type object to Python code •  Negligible overhead to single-threaded execution
  9. “Intercept thread-sensitive calls” •  Anything that the GIL is intended

    to protect •  Py_INCREF/DECREF/CLEAR •  (Anything to do with reference counting.) •  PyMem_Malloc/Free, PyObject_INIT/NEW •  (Anything to do with memory allocation/deallocation.) •  Free lists, globals, etc
  10. “If we’re a parallel thread, do X, if not, do

    Y” •  Y = do what we normally do. •  X = a thread-safe, parallel-specific alternative •  (The topic for another presentation.)
  11. “If we’re a parallel thread, do X, if not, do

    Y” •  “Thread-sensitive” calls are ubiquitous •  The challenge then becomes how quickly you can detect if we’re a parallel thread •  The quicker you can detect it, the less overhead incurred by the whole approach
  12. Py_PXCTX •  The magic macro: <Include/pyparallel.h> #define Py_PXCTX (Py_MainThreadId !=

    _Py_get_current_thread_id()) •  What’s special about _Py_get_current_thread_id()? •  On Window you could use GetCurrentThreadId() •  On POSIX, pthread_self() •  But that involves a syscall/libc trip •  Is there a quicker way?
  13. Windows solution: the TEB #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS #

    include <intrin.h> # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() (__readfsdword(0x20)) # define _Py_get_current_thread_id() (__readfsdword(0x24))
  14. The (amd64) POSIX solution •  __read[fg]sbase() •  Guarantees to return

    a unique value for each thread •  (Whether or not that “unique value” is a thread id is another matter)
  15. Py_PXCTX typical usage •  (Py_MainThreadId == __readfsdword(0x48)) -#define _Py_ForgetReference(op) _Py_INC_TPFREES(op)

    +#define _Py_ForgetReference(op) \ + do { \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + break; \ + } while (0) + +#endif /* WITH_PARALLEL */ •  So, overhead is reduced to a couple more instructions and an extra branch (cost of which minimized by branch prediction (in most cases)) •  That is basically nothing compared to STM or fine-grained locking
  16. Demo: An Async Chargen Service import async class Chargen: def

    initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3', 20019) async.register(transport=server, protocol=Chargen) async.run()
  17. “Send Complete”: Clarification •  Called when a send() (WSASend()) call

    completes (either synchronously or asynchronously) •  What it doesn’t mean: •  The other side definitely got it (they could have closed the connection) •  What it does mean: •  All the data you tried to send successfully became bytes on a wire •  Send socket buffer is empty •  What it implies: •  You’re free to send more data if you’ve got it. •  Why it’s useful: •  Eliminates the need for a producer/consumer relationship •  i.e. pause_producing()/stop_consuming() •  No need to buffer anything internally
  18. Things to note: def chargen def chargen(lineno, nchars=72): start =

    ord(' ') end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b •  No blocking calls •  Will happily consume all CPU when called in a loop
  19. Things to note: class Chargen •  No explicit send() • 

    “Sending” is achieved by returning a “sendable” object •  PyBytesObject •  PyByteArray •  PyUnicode •  Callable PyObject that returns one of the above import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()
  20. Things to note: class Chargen •  Next “send” initiated from

    send_complete() •  ….which generates another send_complete() •  ….and so on •  It’s an “IO hog” •  Always wants to send something when given the chance import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()
  21. Why chargen is such a good demo •  You’re only

    sending 73 bytes at a time •  The CPU time required to generate those 73 bytes is not negligible (compared to the cost of sending 73 bytes) •  Good simulator of real world conditions, where the CPU time to process a client request would dwarf the IO overhead communicating the result back to the client •  With a default send socket buffer size of 8192 bytes and a local netcat client, you’re never going to block during send() •  Thus, processing a single request will immediately throw you into a tight back-to-back send/callback loop, with no opportunity to service other clients (when doing synchronous sends)
  22. Summary •  I suuuuuuucked at CPython internals when I started

    •  “Why do we bother INCREF/DECREFing Py_None?!” •  Probably should have picked an easier “pet project” than parallelizing the interpreter •  I’m really happy with progress so far though: exploit multiple cores without impacting single-threaded performance! •  Lots of work still to do •  The “do X” part (the thread-safe alternatives to “do Y”) is a huge topic that I’m planning on tackling in subsequent presentations •  How the async client/server stuff is implemented is a huge separate topic too
  23. Pre-emptive Answers to your Questions •  I’ve encountered the problem,

    spent weeks debugging it, and come up with a solid solution •  I’ve got a temporary solution in place and a better long term solution in mind •  I’ve thought about the problem, it can probably be exploited to instantly crash the interpreter, and I’m currently dealing with it by not doing that •  “Doctor, it hurts when I do this.” •  I have no idea and the very problem you speak of could threaten the viability of the entire approach