Parallelizing the Python Interpreter: An Alternate Approach to Async

PARALLELIZING THE PYTHON INTERPRETER An alternate approach to async PyCon
2013 Language Summit Trent Nelson trent@snakebite.org @trentnelson

Background/Motivation •  Two parts: •  “Parallelizing” the interpreter •  Allowing
multiple threads to run CPython internals concurrently •  True “asynchronicity” •  Leverage the parallel facilities above to expose a “truly asynchronous” async API to Python code •  (Allow Python code to run concurrently within the same interpreter) •  I’ve had a vague idea for how to go about the parallel aspect for a while •  The async discussions on python-ideas last year provided the motivation to try and tie both things together •  Also figured it would be a great pet project to better familiarize myself with CPython internals

Details •  Have been working on it since late-December-ish (full
time until mid Feb-ish) •  It all lives in hg.python.org/sandbox/trent, px branch •  Include/object.h •  Include/pyparallel.h •  Python/pyparallel.c & pyparallel_private.h •  Vast majority of code •  The code quality is… ultra proof-of-concept/prototype/ hackfest •  Think of a how a homeless hoarder would treat his shopping cart •  Everything gets added, nothing gets removed •  ….so if you’re hoping to quickly grok how it all works by reviewing the code… you’re going to have a bad time. •  Only works on Vista+ (for now) •  Aim is to get everything working first, then refactor/review, review support for other platforms etc

How it’s exposed to Python: The Async Façade •  Submission
of arbitrary “work”: async.submit_work(func, args, kwds, callback=None, errback=None) •  Calls func(args, kwds) from a parallel thread* •  Submission of timers: async.submit_timer(time, func, args, kwds, …) async.submit_timer(interval, func, args, kwds, …) •  Calls func(args, kwds) from a parallel thread some ‘time’ in the future, or every interval. •  Submission of “waits”: async.submit_wait(obj, func, args, kwds, …) •  Calls func(args, kwds) from a parallel thread when ‘obj’ is signalled

The Async Façade (cont.) •  Asynchronous file I/O f =
async.open(filename, ‘wb’) async.write(f, buf, callback=…, errback=…) •  Asynchronous client/server services client = async.client(host, port) server = async.server(host, port) async.register(transport=client, protocol=…) async.register(transport=server, protocol=…) async.run() 

A QUICK EXAMPLE Chargen: The Character Generator

Chargen: Character Generator def chargen(lineno, nchars=72): start = ord(' ')
end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b

Demo: An Async Chargen Service import async class Chargen: def
initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3', 20019) async.register(transport=server, protocol=Chargen) async.run()

1 Chargen (99/25%/67%)

2 Chargen (99/54%/67%)

3 Chargen (99/77%/67%)

4 Chargen (99/99%/68%)

5 Chargen?! (99/99%/67%)

Things to note with the demo… •  Constant memory use
•  Only one python_d.exe process… •  ….exploiting all cores •  ....without any explicit “thread” usage (i.e. no threading.Thread instances) •  (Not that threading.Thread instances would automatically exploit all cores)

HOW IT WORKS

How it works… •  No GIL removal •  No fine-grained
locking •  Not using STM •  No “free threading” •  i.e. concurrency isn’t achieved by exposing a threading.Thread-type object to Python code •  Negligible overhead to single-threaded execution

The Premise • Intercept all thread-sensitive calls • If we’re a parallel
thread, do X, if not, do Y.

“Intercept thread-sensitive calls” •  Anything that the GIL is intended
to protect •  Py_INCREF/DECREF/CLEAR •  (Anything to do with reference counting.) •  PyMem_Malloc/Free, PyObject_INIT/NEW •  (Anything to do with memory allocation/deallocation.) •  Free lists, globals, etc

“If we’re a parallel thread, do X, if not, do
Y” •  Y = do what we normally do. •  X = a thread-safe, parallel-specific alternative •  (The topic for another presentation.)

“If we’re a parallel thread, do X, if not, do
Y” •  “Thread-sensitive” calls are ubiquitous •  The challenge then becomes how quickly you can detect if we’re a parallel thread •  The quicker you can detect it, the less overhead incurred by the whole approach

Py_PXCTX •  The magic macro: <Include/pyparallel.h> #define Py_PXCTX (Py_MainThreadId !=
_Py_get_current_thread_id()) •  What’s special about _Py_get_current_thread_id()? •  On Window you could use GetCurrentThreadId() •  On POSIX, pthread_self() •  But that involves a syscall/libc trip •  Is there a quicker way?

Windows solution: the TEB #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS #
include <intrin.h> # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() (__readfsdword(0x20)) # define _Py_get_current_thread_id() (__readfsdword(0x24))

The (amd64) POSIX solution •  __read[fg]sbase() •  Guarantees to return
a unique value for each thread •  (Whether or not that “unique value” is a thread id is another matter)

Py_PXCTX typical usage •  (Py_MainThreadId == __readfsdword(0x48)) -#define _Py_ForgetReference(op) _Py_INC_TPFREES(op)
+#define _Py_ForgetReference(op) \ + do { \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + break; \ + } while (0) + +#endif /* WITH_PARALLEL */ •  So, overhead is reduced to a couple more instructions and an extra branch (cost of which minimized by branch prediction (in most cases)) •  That is basically nothing compared to STM or fine-grained locking

BACK TO CHARGEN

Demo: An Async Chargen Service import async class Chargen: def
initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3', 20019) async.register(transport=server, protocol=Chargen) async.run()

5 Chargen?! (99/99%/67%)

“Send Complete”: Clarification •  Called when a send() (WSASend()) call
completes (either synchronously or asynchronously) •  What it doesn’t mean: •  The other side definitely got it (they could have closed the connection) •  What it does mean: •  All the data you tried to send successfully became bytes on a wire •  Send socket buffer is empty •  What it implies: •  You’re free to send more data if you’ve got it. •  Why it’s useful: •  Eliminates the need for a producer/consumer relationship •  i.e. pause_producing()/stop_consuming() •  No need to buffer anything internally

Things to note: def chargen def chargen(lineno, nchars=72): start =
ord(' ') end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b •  No blocking calls •  Will happily consume all CPU when called in a loop

Things to note: class Chargen •  No explicit send() • 
“Sending” is achieved by returning a “sendable” object •  PyBytesObject •  PyByteArray •  PyUnicode •  Callable PyObject that returns one of the above import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()

Things to note: class Chargen •  Next “send” initiated from
send_complete() •  ….which generates another send_complete() •  ….and so on •  It’s an “IO hog” •  Always wants to send something when given the chance import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()

Why chargen is such a good demo •  You’re only
sending 73 bytes at a time •  The CPU time required to generate those 73 bytes is not negligible (compared to the cost of sending 73 bytes) •  Good simulator of real world conditions, where the CPU time to process a client request would dwarf the IO overhead communicating the result back to the client •  With a default send socket buffer size of 8192 bytes and a local netcat client, you’re never going to block during send() •  Thus, processing a single request will immediately throw you into a tight back-to-back send/callback loop, with no opportunity to service other clients (when doing synchronous sends)

Summary •  I suuuuuuucked at CPython internals when I started
•  “Why do we bother INCREF/DECREFing Py_None?!” •  Probably should have picked an easier “pet project” than parallelizing the interpreter •  I’m really happy with progress so far though: exploit multiple cores without impacting single-threaded performance! •  Lots of work still to do •  The “do X” part (the thread-safe alternatives to “do Y”) is a huge topic that I’m planning on tackling in subsequent presentations •  How the async client/server stuff is implemented is a huge separate topic too

Pre-emptive Answers to your Questions •  I’ve encountered the problem,
spent weeks debugging it, and come up with a solid solution •  I’ve got a temporary solution in place and a better long term solution in mind •  I’ve thought about the problem, it can probably be exploited to instantly crash the interpreter, and I’m currently dealing with it by not doing that •  “Doctor, it hurts when I do this.” •  I have no idea and the very problem you speak of could threaten the viability of the entire approach

QUESTIONS? trent@snakebite.org @trentnelson

Parallelizing the Python Interpreter: An Altern...

Parallelizing the Python Interpreter: An Alternate Approach to Async

More Decks by Trent Nelson

Other Decks in Programming

Featured

Transcript