n milliseconds - frequent stacks are hot code paths - generally low-overhead - many third-party implementations for Python https://github.com/joerick/pyinstrument https://github.com/bdarnell/plop https://github.com/nylas/nylas-perftools https://github.com/what-studio/profiling etc. In-process profiling: two common strategies 12
traceback, starting from the interrupted stack frame stack = traceback.format_stack(frame) # record it somewhere record(stack) # Interpreter: Call sample() whenever we get SIGVTALRM! signal.signal(signal.SIGVTALRM, sample) # Operating system: Send the SIGVTALRM signal every 0.01 seconds! signal.setitimer(signal.ITIMER_VIRTUAL, 0.01, 0.01) A profiling workflow Sample stacks 13
traceback, starting from the interrupted stack frame stack = traceback.format_stack(frame) # record it somewhere record(stack) # Interpreter: Call sample() whenever we get SIGVTALRM! signal.signal(signal.SIGVTALRM, sample) # Operating system: Send the SIGVTALRM signal every 0.01 seconds! signal.setitimer(signal.ITIMER_VIRTUAL, 0.01, 0.01) A profiling workflow Sample stacks 14
traceback, starting from the interrupted stack frame stack = traceback.format_stack(frame) # record it somewhere record(stack) # Interpreter: Call sample() whenever we get SIGVTALRM! signal.signal(signal.SIGVTALRM, sample) # Operating system: Send the SIGVTALRM signal every 0.01 seconds! signal.setitimer(signal.ITIMER_VIRTUAL, 0.01, 0.01) A profiling workflow Sample stacks 15
traceback, starting from the interrupted stack frame stack = traceback.format_stack(frame) # record it somewhere record(stack) # Interpreter: Call sample() whenever we get SIGVTALRM! signal.signal(signal.SIGVTALRM, sample) # Operating system: Send the SIGVTALRM signal every 0.01 seconds! signal.setitimer(signal.ITIMER_VIRTUAL, 0.01, 0.01) A profiling workflow Sample stacks 16
#0 0x000000000047fe34 in lookdict_string (mp=0x7ffff7f64be8, key='True' hash=-1504860438477502276) at ../Objects/dictobject.c:430 #1 0x00000000005241b4 in PyEval_EvalFrameEx ( f=Frame 0x7ffff7f477c0, for file script.py, line 2, in f (), throwfl at ../Python/ceval.c:2107 #2 0x000000000052bcb7 in fast_function (func=<function at remote 0x7fff pp_stack=0x7fffffffdda0, n=0, na=0, nk=0) at ../Python/ceval.c:4119 #3 0x000000000052b9ca in call_function (pp_stack=0x7fffffffdda0, oparg= at ../Python/ceval.c:4054 #4 0x00000000005266ca in PyEval_EvalFrameEx ( f=Frame 0x7ffff7f475e8, for file script.py, line 6, in g (), throwfl at ../Python/ceval.c:2679 includes Python context (if gdb extensions are installed) Just run this in a loop to get a sample set. more: poormansprofiler.org 25
10Hz sample rate $ gdb -p $PID -ex bt --batch #0 0x000000000047fe34 in lookdict_string (mp=0x7ffff7f64be8, key='True' hash=-1504860438477502276) at ../Objects/dictobject.c:430 #1 0x00000000005241b4 in PyEval_EvalFrameEx ( f=Frame 0x7ffff7f477c0, for file script.py, line 2, in f (), throwfl at ../Python/ceval.c:2107 #2 0x000000000052bcb7 in fast_function (func=<function at remote 0x7fff pp_stack=0x7fffffffdda0, n=0, na=0, nk=0) at ../Python/ceval.c:4119 #3 0x000000000052b9ca in call_function (pp_stack=0x7fffffffdda0, oparg= at ../Python/ceval.c:4054 #4 0x00000000005266ca in PyEval_EvalFrameEx ( f=Frame 0x7ffff7f475e8, for file script.py, line 6, in g (), throwfl at ../Python/ceval.c:2679 Can we do better? 26
for dynamic instrumentation. - You write a script specifying probe points and handlers, e.g. probe nd_syscall.open { printf("%d %s\n", pid(), filename()) } // Every time a process opens a file, print the PID and filename - SystemTap compiles this into C and loads it as a kernel module. - Can be challenging to use, but very powerful. - less overhead for this use case than gdb, which uses the ptrace API. sourceware.org/systemtap/ 27
our Python process -e "probe perf.sw.task_clock.process # at every tick of its process clock { print_ubacktrace() }" # print a userspace backtrace Step 1: get C stacks 29
to build an execution timeline. These use a kernel facility called uprobes - basically an API to insert software breakpoints at a userspace instruction address, then execute a handler when the breakpoint gets hit. https://www.google.com/patents/US20080133896 50
long len, MYSQL_FIELD *field) { PyObject *v; if (converter != Py_None) { v = PyObject_CallFunction(converter, "s#", rowitem, (int)len)); } else { v = PyBytes_FromStringAndSize(rowitem, (int)len); } // ... if rowitem is a string-encoded float ("2.2718") Then this turns rowitem into a Python string, then calls float() on it Here, the database driver is turning raw character buffers from the database into Python objects. 59
unsigned long len, MYSQL_FIELD *field) { PyObject *v; - if (converter != Py_None) { + if (field_type == FIELD_TYPE_FLOAT) { + v = PyFloat_FromDouble(strtod(rowitem, NULL)); + } else if (converter != Py_None) { v = PyObject_CallFunction(converter, "s#", rowitem, (int)len); } else { v = PyBytes_FromStringAndSize(rowitem, (int)len); } // ... Result: ~15% more requests served per second Let's do that conversion in C 60
aggregate view of our application's hotspots, down to the execution profile of a single extension function, without needing to stop or change our application. 61
i in range(200)] dt = np.dtype({"names": names, "formats": 200 * [np.uint8]}) array = np.zeros((1000,), dt) sortedIndices = np.argsort(array, order='field0') The problem script "calling argsort() on a structured array creates a leak" We'll just let this run in the background while we trace it. 64
Whenever an object is allocated, keep track of the stack trace // and timestamp alloc_bt[$return] = sprint_merged_stack() alloc_time[$return] = gettimeofday_ms() } probe process.function("_PyObject_Free").return { // Clear that data when the object is freed delete alloc_bt[$p] delete alloc_time[$p] } Let's try tracing allocations 65
Whenever an object is allocated, keep track of the stack trace // and timestamp alloc_bt[$return] = sprint_merged_stack() alloc_time[$return] = gettimeofday_ms() } probe process.function("_PyObject_Free").return { // Clear that data when the object is freed delete alloc_bt[$p] delete alloc_time[$p] } Let's try tracing allocations 66
Whenever an object is allocated, keep track of the stack trace // and timestamp alloc_bt[$return] = sprint_merged_stack() alloc_time[$return] = gettimeofday_ms() } probe process.function("_PyObject_Free").return { // Clear that data when the object is freed delete alloc_bt[$p] delete alloc_time[$p] } Let's try tracing allocations 67
type_call arraydescr_new PyArray_DescrConverter _convert_from_dict PySequence_Tuple PyList_AsTuple PyTuple_New _PyObject_GC_NewVar Let's run the test case in the background, and trace for ~30 seconds. If we take the output and graph "surviving allocations over time" ... okay, looks leaky. 68
type_call arraydescr_new PyArray_DescrConverter _convert_from_dict PySequence_Tuple PyList_AsTuple PyTuple_New _PyObject_GC_NewVar This is in C extension code -- likely caused by a missing call to Py_DECREF. Can we figure out where that call should go? 69
i in range(200)] # find a leaky allocation here dt = np.dtype({"names": names, "formats": 200 * [np.uint8]}) array = np.zeros((1000,), dt) # track its refcount as this function executes sortedIndices = np.argsort(array, order='field0') The problem script "calling argsort() on a structured array creates a leak" 70
global ptr=0, refcount=0 global problem_stack = "start;__libc_start_main;..." probe process.function("_PyObject_Malloc").return { if (sprint_merged_stack() == problem_stack && ptr == 0) { printf("Saving pointer %p\n", $return) ptr = $return } } When we see a leaky allocation, keep track of that pointer 71
(ptr != 0) { object_ptr = &@cast(ptr, "PyGC_Head", @PY)[1] new_refcount = @cast(object_ptr, "PyObject", @PY)->ob_refcnt if (new_refcount != refcount) { refcount = new_refcount; printf("New refcount: %d at %s\n", refcount, pp()) } } } check if the object's refcount has changed statement probe: at each line of the suspicious function 72
New refcount: 2 at process("...").statement("array_argsort@...:1274") New refcount: 3 at process("...").statement("array_argsort@...:1238") Function entry point suspicious 73
tool. Other things you could do: - Show a live latency histogram for a particular function in your code - Correlate your application's behavior with network or other system activity - Whatever you come up with. 76
instrumentation of running programs - Can show lots of low-level detail - Language-agnostic: can be applied to debug other parts of your stack as well. Cons: - Steep learning curve - Some initial setup cost 77
best practices, wrapper scripts (github.com/emfree/systemtap-python-tools) - standalone/userspace tools for out-of-process profiling, etc. (github.com/eklitzke/pystack, github.com/jvns/ruby-stacktrace) - interpreter support for external tracers 80
firsthand experience :( - github.com/vmprof/vmprof-python was built to profile PyPy, including JIT introspection. - Linux `perf` supports JITted runtimes like Node.js and Java if they write a symbol map to /tmp/perf-PID.map. PyPy could do that too. 83