multiple threads to run CPython internals concurrently • True “asynchronicity” • Leverage the parallel facilities above to expose a “truly asynchronous” async API to Python code • (Allow Python code to run concurrently within the same interpreter) • I’ve had a vague idea for how to go about the parallel aspect for a while • The async discussions on python-ideas last year provided the motivation to try and tie both things together • Also figured it would be a great pet project to better familiarize myself with CPython internals
time until mid Feb-ish) • It all lives in hg.python.org/sandbox/trent, px branch • Include/object.h • Include/pyparallel.h • Python/pyparallel.c & pyparallel_private.h • Vast majority of code • The code quality is… ultra proof-of-concept/prototype/ hackfest • Think of a how a homeless hoarder would treat his shopping cart • Everything gets added, nothing gets removed • ….so if you’re hoping to quickly grok how it all works by reviewing the code… you’re going to have a bad time. • Only works on Vista+ (for now) • Aim is to get everything working first, then refactor/review, review support for other platforms etc
of arbitrary “work”: async.submit_work(func, args, kwds, callback=None, errback=None) • Calls func(args, kwds) from a parallel thread* • Submission of timers: async.submit_timer(time, func, args, kwds, …) async.submit_timer(interval, func, args, kwds, …) • Calls func(args, kwds) from a parallel thread some ‘time’ in the future, or every interval. • Submission of “waits”: async.submit_wait(obj, func, args, kwds, …) • Calls func(args, kwds) from a parallel thread when ‘obj’ is signalled
end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b
• Only one python_d.exe process… • ….exploiting all cores • ....without any explicit “thread” usage (i.e. no threading.Thread instances) • (Not that threading.Thread instances would automatically exploit all cores)
locking • Not using STM • No “free threading” • i.e. concurrency isn’t achieved by exposing a threading.Thread-type object to Python code • Negligible overhead to single-threaded execution
to protect • Py_INCREF/DECREF/CLEAR • (Anything to do with reference counting.) • PyMem_Malloc/Free, PyObject_INIT/NEW • (Anything to do with memory allocation/deallocation.) • Free lists, globals, etc
Y” • “Thread-sensitive” calls are ubiquitous • The challenge then becomes how quickly you can detect if we’re a parallel thread • The quicker you can detect it, the less overhead incurred by the whole approach
_Py_get_current_thread_id()) • What’s special about _Py_get_current_thread_id()? • On Window you could use GetCurrentThreadId() • On POSIX, pthread_self() • But that involves a syscall/libc trip • Is there a quicker way?
+#define _Py_ForgetReference(op) \ + do { \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + break; \ + } while (0) + +#endif /* WITH_PARALLEL */ • So, overhead is reduced to a couple more instructions and an extra branch (cost of which minimized by branch prediction (in most cases)) • That is basically nothing compared to STM or fine-grained locking
completes (either synchronously or asynchronously) • What it doesn’t mean: • The other side definitely got it (they could have closed the connection) • What it does mean: • All the data you tried to send successfully became bytes on a wire • Send socket buffer is empty • What it implies: • You’re free to send more data if you’ve got it. • Why it’s useful: • Eliminates the need for a producer/consumer relationship • i.e. pause_producing()/stop_consuming() • No need to buffer anything internally
ord(' ') end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b • No blocking calls • Will happily consume all CPU when called in a loop
“Sending” is achieved by returning a “sendable” object • PyBytesObject • PyByteArray • PyUnicode • Callable PyObject that returns one of the above import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()
send_complete() • ….which generates another send_complete() • ….and so on • It’s an “IO hog” • Always wants to send something when given the chance import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()
sending 73 bytes at a time • The CPU time required to generate those 73 bytes is not negligible (compared to the cost of sending 73 bytes) • Good simulator of real world conditions, where the CPU time to process a client request would dwarf the IO overhead communicating the result back to the client • With a default send socket buffer size of 8192 bytes and a local netcat client, you’re never going to block during send() • Thus, processing a single request will immediately throw you into a tight back-to-back send/callback loop, with no opportunity to service other clients (when doing synchronous sends)
• “Why do we bother INCREF/DECREFing Py_None?!” • Probably should have picked an easier “pet project” than parallelizing the interpreter • I’m really happy with progress so far though: exploit multiple cores without impacting single-threaded performance! • Lots of work still to do • The “do X” part (the thread-safe alternatives to “do Y”) is a huge topic that I’m planning on tackling in subsequent presentations • How the async client/server stuff is implemented is a huge separate topic too
spent weeks debugging it, and come up with a solid solution • I’ve got a temporary solution in place and a better long term solution in mind • I’ve thought about the problem, it can probably be exploited to instantly crash the interpreter, and I’m currently dealing with it by not doing that • “Doctor, it hurts when I do this.” • I have no idea and the very problem you speak of could threaten the viability of the entire approach