Slide 1

Slide 1 text

PARALLELIZING THE PYTHON INTERPRETER An alternate approach to async PyCon 2013 Language Summit Trent Nelson [email protected] @trentnelson

Slide 2

Slide 2 text

Background/Motivation •  Two parts: •  “Parallelizing” the interpreter •  Allowing multiple threads to run CPython internals concurrently •  True “asynchronicity” •  Leverage the parallel facilities above to expose a “truly asynchronous” async API to Python code •  (Allow Python code to run concurrently within the same interpreter) •  I’ve had a vague idea for how to go about the parallel aspect for a while •  The async discussions on python-ideas last year provided the motivation to try and tie both things together •  Also figured it would be a great pet project to better familiarize myself with CPython internals

Slide 3

Slide 3 text

Details •  Have been working on it since late-December-ish (full time until mid Feb-ish) •  It all lives in hg.python.org/sandbox/trent, px branch •  Include/object.h •  Include/pyparallel.h •  Python/pyparallel.c & pyparallel_private.h •  Vast majority of code •  The code quality is… ultra proof-of-concept/prototype/ hackfest •  Think of a how a homeless hoarder would treat his shopping cart •  Everything gets added, nothing gets removed •  ….so if you’re hoping to quickly grok how it all works by reviewing the code… you’re going to have a bad time. •  Only works on Vista+ (for now) •  Aim is to get everything working first, then refactor/review, review support for other platforms etc

Slide 4

Slide 4 text

How it’s exposed to Python: The Async Façade •  Submission of arbitrary “work”: async.submit_work(func, args, kwds, callback=None, errback=None) •  Calls func(args, kwds) from a parallel thread* •  Submission of timers: async.submit_timer(time, func, args, kwds, …) async.submit_timer(interval, func, args, kwds, …) •  Calls func(args, kwds) from a parallel thread some ‘time’ in the future, or every interval. •  Submission of “waits”: async.submit_wait(obj, func, args, kwds, …) •  Calls func(args, kwds) from a parallel thread when ‘obj’ is signalled

Slide 5

Slide 5 text

The Async Façade (cont.) •  Asynchronous file I/O f = async.open(filename, ‘wb’) async.write(f, buf, callback=…, errback=…) •  Asynchronous client/server services client = async.client(host, port) server = async.server(host, port) async.register(transport=client, protocol=…) async.register(transport=server, protocol=…) async.run()


Slide 6

Slide 6 text

A QUICK EXAMPLE Chargen: The Character Generator

Slide 7

Slide 7 text

Chargen: Character Generator def chargen(lineno, nchars=72): start = ord(' ') end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b

Slide 8

Slide 8 text

Demo: An Async Chargen Service import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3', 20019) async.register(transport=server, protocol=Chargen) async.run()

Slide 9

Slide 9 text

1 Chargen (99/25%/67%)

Slide 10

Slide 10 text

2 Chargen (99/54%/67%)

Slide 11

Slide 11 text

3 Chargen (99/77%/67%)

Slide 12

Slide 12 text

4 Chargen (99/99%/68%)

Slide 13

Slide 13 text

5 Chargen?! (99/99%/67%)

Slide 14

Slide 14 text

Things to note with the demo… •  Constant memory use •  Only one python_d.exe process… •  ….exploiting all cores •  ....without any explicit “thread” usage (i.e. no threading.Thread instances) •  (Not that threading.Thread instances would automatically exploit all cores)

Slide 15

Slide 15 text

HOW IT WORKS

Slide 16

Slide 16 text

How it works… •  No GIL removal •  No fine-grained locking •  Not using STM •  No “free threading” •  i.e. concurrency isn’t achieved by exposing a threading.Thread-type object to Python code •  Negligible overhead to single-threaded execution

Slide 17

Slide 17 text

The Premise • Intercept all thread-sensitive calls • If we’re a parallel thread, do X, if not, do Y.

Slide 18

Slide 18 text

“Intercept thread-sensitive calls” •  Anything that the GIL is intended to protect •  Py_INCREF/DECREF/CLEAR •  (Anything to do with reference counting.) •  PyMem_Malloc/Free, PyObject_INIT/NEW •  (Anything to do with memory allocation/deallocation.) •  Free lists, globals, etc

Slide 19

Slide 19 text

“If we’re a parallel thread, do X, if not, do Y” •  Y = do what we normally do. •  X = a thread-safe, parallel-specific alternative •  (The topic for another presentation.)

Slide 20

Slide 20 text

“If we’re a parallel thread, do X, if not, do Y” •  “Thread-sensitive” calls are ubiquitous •  The challenge then becomes how quickly you can detect if we’re a parallel thread •  The quicker you can detect it, the less overhead incurred by the whole approach

Slide 21

Slide 21 text

Py_PXCTX •  The magic macro: #define Py_PXCTX (Py_MainThreadId != _Py_get_current_thread_id()) •  What’s special about _Py_get_current_thread_id()? •  On Window you could use GetCurrentThreadId() •  On POSIX, pthread_self() •  But that involves a syscall/libc trip •  Is there a quicker way?

Slide 22

Slide 22 text

Windows solution: the TEB #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS # include # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() (__readfsdword(0x20)) # define _Py_get_current_thread_id() (__readfsdword(0x24))

Slide 23

Slide 23 text

The (amd64) POSIX solution •  __read[fg]sbase() •  Guarantees to return a unique value for each thread •  (Whether or not that “unique value” is a thread id is another matter)

Slide 24

Slide 24 text

Py_PXCTX typical usage •  (Py_MainThreadId == __readfsdword(0x48)) -#define _Py_ForgetReference(op) _Py_INC_TPFREES(op) +#define _Py_ForgetReference(op) \ + do { \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + break; \ + } while (0) + +#endif /* WITH_PARALLEL */ •  So, overhead is reduced to a couple more instructions and an extra branch (cost of which minimized by branch prediction (in most cases)) •  That is basically nothing compared to STM or fine-grained locking

Slide 25

Slide 25 text

BACK TO CHARGEN

Slide 26

Slide 26 text

Demo: An Async Chargen Service import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3', 20019) async.register(transport=server, protocol=Chargen) async.run()

Slide 27

Slide 27 text

5 Chargen?! (99/99%/67%)

Slide 28

Slide 28 text

“Send Complete”: Clarification •  Called when a send() (WSASend()) call completes (either synchronously or asynchronously) •  What it doesn’t mean: •  The other side definitely got it (they could have closed the connection) •  What it does mean: •  All the data you tried to send successfully became bytes on a wire •  Send socket buffer is empty •  What it implies: •  You’re free to send more data if you’ve got it. •  Why it’s useful: •  Eliminates the need for a producer/consumer relationship •  i.e. pause_producing()/stop_consuming() •  No need to buffer anything internally

Slide 29

Slide 29 text

Things to note: def chargen def chargen(lineno, nchars=72): start = ord(' ') end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b •  No blocking calls •  Will happily consume all CPU when called in a loop

Slide 30

Slide 30 text

Things to note: class Chargen •  No explicit send() •  “Sending” is achieved by returning a “sendable” object •  PyBytesObject •  PyByteArray •  PyUnicode •  Callable PyObject that returns one of the above import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()

Slide 31

Slide 31 text

Things to note: class Chargen •  Next “send” initiated from send_complete() •  ….which generates another send_complete() •  ….and so on •  It’s an “IO hog” •  Always wants to send something when given the chance import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()

Slide 32

Slide 32 text

Why chargen is such a good demo •  You’re only sending 73 bytes at a time •  The CPU time required to generate those 73 bytes is not negligible (compared to the cost of sending 73 bytes) •  Good simulator of real world conditions, where the CPU time to process a client request would dwarf the IO overhead communicating the result back to the client •  With a default send socket buffer size of 8192 bytes and a local netcat client, you’re never going to block during send() •  Thus, processing a single request will immediately throw you into a tight back-to-back send/callback loop, with no opportunity to service other clients (when doing synchronous sends)

Slide 33

Slide 33 text

Summary •  I suuuuuuucked at CPython internals when I started •  “Why do we bother INCREF/DECREFing Py_None?!” •  Probably should have picked an easier “pet project” than parallelizing the interpreter •  I’m really happy with progress so far though: exploit multiple cores without impacting single-threaded performance! •  Lots of work still to do •  The “do X” part (the thread-safe alternatives to “do Y”) is a huge topic that I’m planning on tackling in subsequent presentations •  How the async client/server stuff is implemented is a huge separate topic too

Slide 34

Slide 34 text

Pre-emptive Answers to your Questions •  I’ve encountered the problem, spent weeks debugging it, and come up with a solid solution •  I’ve got a temporary solution in place and a better long term solution in mind •  I’ve thought about the problem, it can probably be exploited to instantly crash the interpreter, and I’m currently dealing with it by not doing that •  “Doctor, it hurts when I do this.” •  I have no idea and the very problem you speak of could threaten the viability of the entire approach

Slide 35

Slide 35 text

QUESTIONS? [email protected] @trentnelson