Slide 1

Slide 1 text

Improving debuggability of complex asyncio applications Joongi Kim [⾦駿起] / Lablup Inc. PyCon US & APAC 2023

Slide 2

Slide 2 text

Improving debuggability of complex asyncio applications Joongi Kim [⾦駿起] / Lablup Inc. PyCon US & APAC 2023

Slide 3

Slide 3 text

About me • PyCon KR/JP/US/APAC speaker • Seoul, South Korea • CTO & Co-founder of Lablup Inc. • PhD in Computer Science (packet processing acceleration using DPDK and GPUs & Xeon Phi) • A humble asyncio ecosystem contributor • Hobby pianist APAC 2016 in Seoul US 2023 in Salt Lake City JP 2022 in Tokyo

Slide 4

Slide 4 text

About Lablup • Co-founded in 2015 • Key product: – MLDev/MLOps platform with highlights on GPU resource mgmt. – Powering 70+ enterprise customers (incl. a largest GPU cluster in Latin America) on 3 continents with Python asyncio! Used asyncio since 2014 (Python 3.4) / ~70K LoC Sibling open source projects: aiotools, Callosum, aiomonitor-ng Contributed: aiohttp, aiodocker, janus, async-timeout, asyncio – Model: Open core + proprietary enterprise plugins & tech support https://github.com/lablup/backend.ai

Slide 5

Slide 5 text

Updates after PyCon US • aiomonitor v0.5 (Jul 21st) – Backported aiomonitor-ng – I became the new maintainer! • aiomonitor v0.6 (Aug 27th) – GUI made with htmx + alpine.js + tailwind.css – Reference Breaking Down Barriers: A Unified Approach to Web Development with django-htmx Track 2, Bae KwonHan

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Challenges in debugging asyncio apps Improving debuggability of complex asyncio applications

Slide 8

Slide 8 text

Production debugging • The hard problems – Memory leak / CPU hogging – Unexpected termination of critical long-running tasks – Loss of vital error logs – When these are combined with 3rd party codes that I have no control... – When these happen in remote, air-gapped clusters... • How to handle these in asyncio-based apps?

Slide 9

Slide 9 text

Solutions for “hard" problems Prevention via structured concurrency[1] Debugging via postmortem asyncio.TaskGroup aio-libs/{aiomonitor, aioconsole} asyncio.all_tasks() Debugging via live inspection achimnol/aiomonitor-ng Task termination chains achimnol/aiomonitor-ng Task creation chains asyncio 3rd parties asyncio.Supervisor asyncio.as_completed_safe() trio.open_nursery() [1] https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ aiotools.PersistentTaskGroup aiotools.as_completed_safe() ? ? Traceback & logging N/A

Slide 10

Slide 10 text

Traceback in asyncio • Python traceback gives: – The type and arguments of the exception – The location where the exception is raised in the stack • Are these sufficient? – What if a 3rd-party code swallows the cancellation? – What if a task exits with an error but its garbage collection is delayed (e.g., due to extra references from long-lived objects)? File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError

Slide 11

Slide 11 text

Traceback in asyncio • We also need to know: – Who cancelled this task? e.g., Where did my precious task go?! – Who created this task? e.g., When there are an unexpectedly large number of tasks... File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError

Slide 12

Slide 12 text

The world of async tasks My Code 3rd Party Code My Code 3rd Party Code My Code My Code task stack Error! Swallows cancellation? raise cancel() await ... ... My Code 3rd Party Code create_task() await await An asyncio app is a giant tree of multiple tasks with their own stacks, interacting with each other. callback task stack

Slide 13

Slide 13 text

aiomonitor v0.5 Improving debuggability of complex asyncio applications

Slide 14

Slide 14 text

A brief architecture of aiomonitor asyncio Application A Separate Terminal App Thread aiomonitor Thread Telnet client asyncio event loop telnet ui loop REPL with start_monitor():

Slide 15

Slide 15 text

Quick start import aiomonitor async def main(): loop = asyncio.get_running_loop() run_forever = loop.create_future() with aiomonitor.start_monitor(loop): await run_forever try: asyncio.run(main()) except KeyboardInterrupt: pass $ pip install aiomonitor-ng $ edit test.py $ python test.py & $ telnet localhost 50101

Slide 16

Slide 16 text

Quick start monitor >>> help Usage: help [OPTIONS] COMMAND [ARGS]... To see the usage of each command, run them with "--help" option. Commands: cancel Cancel an indicated task console Switch to async Python REPL exit (q,quit) Leave the monitor client session help (?,h) Show the list of commands ps (p) Show task table ps-terminated (pst,pt) List recently terminated/cancelled tasks signal Send a Unix signal stacktrace (st,stack) Print a stack trace from the event loop thread where (w) Show stack frames and the task creation chain of a task where-terminated (wt) Show stack frames and the termination/cancellation chain of a task

Slide 17

Slide 17 text

Additions in aiomonitor-ng • The task creation chain tracker • The task termination tracker • The task cancellation chain tracker • prompt_toolkit + Click integration for rich CLI

Slide 18

Slide 18 text

Task creation chain tracker • Q: Who created this task and its parents? • Implementation – class TracedTrask(asyncio.Task) Thanks to asyncio.set_task_factory() – Store weakref of tasks in a bookkeeper – Traverse through the chain of weakrefs upon where command • Assumption – Parent tasks are alive while child tasks are alive. (Lifetime inclusion) Structured concurrency and asyncio.TaskGroup encourage this. It makes easier to prevent memory leak in the creation chain tracker.

Slide 19

Slide 19 text

Task C Task B Task A await ...() ...() create_task(...) await ...() create_task(...) ...() await ...() await ...() create_task(...) Task D await ...() await ...() weak keys ...() Task A Task B Task C Task D _created_traceback_chain: WeakKeyDictionary[Task, weakref.ReferenceType[Task]) Task creation chain tracker _created_tracebacks: WeakKeyDictionary[Task, list[traceback.FrameSummary]) Task A Task B Task B stack A stack B stack C stack D weak values

Slide 20

Slide 20 text

Task termination tracker • Q. Why is this task terminated (even it was a long before...)? • Implementation – Register _trace_termination() as the task done callback – get_trace_id(): Task object's ID is no longer valid after gc. Combine task ID and name (default: Task-XXX) to mangle a unique identifier to refer in the termination logs – Prevent memory leak by limiting the maximum length of history – Add @preserve_termination_logs decorator to mark specific coroutine task's termination log forever

Slide 21

Slide 21 text

Task cancellation chain tracker • Q: Who cancelled this task, which also may be cancelled by another? • Implementation – Override Task.cancel() to store CancellationChain current_task() in cancel() is the canceller task, while self in cancel() is the cancellee task. We can extract the canceller's stack trace at this point. – Deligate postprocessing to the monitor UI thread which holds the tracker objects. Uses janus for sync/async queue interfacing. Removes tracker objects of cancelled/terminated tasks when necessary (e.g., reached the history limit)

Slide 22

Slide 22 text

Task B Task X await ...() ...() task_b.cancel(...) await ...() _termination_info_queue: janus.Queue[TerminatedTaskInfo] Task termination & cancellation chain tracker _cancellation_chain_queue: janus.Queue[CancellationChain] Monitor UI thread App thread await ...() canceller cancellee cancel() extract_stack(task) filter_stack(task) cleanup_trackers() task done callback TracedTask.cancel()

Slide 23

Slide 23 text

prompt_toolkit + Click integration • aiomonitor – Multiple concurrent telnet sessions (resolving aio-libs/aiomonitor#140) – Auto-completion of commands and task IDs / filtering the task list • Implementation – Combines several contextvars to keep track of: The monitor instance The current stdout buffer linked with the current telnet connection – Signals Click command completion with asyncio.Event – Defines a custom ClickCompleter adapter

Slide 24

Slide 24 text

Live task list (abbrev. of long paths)

Slide 25

Slide 25 text

Filtering the live task list by coro/task names List of terminated / cancelled tasks

Slide 26

Slide 26 text

A part of task creation chain trace showing that a 3rd party library (grpcio) creates internal sub-tasks ← my code ← 3rd party code

Slide 27

Slide 27 text

aiomonitor v0.6 Improving debuggability of complex asyncio applications

Slide 28

Slide 28 text

Technical Goals of Web UI in aiomonitor • Rewritable by backend developers – The core logic should live at the server-side code. – Let's use a "good default" design framework! 🌞 • Lightweight dev setup – No additional build chain for Javascript (e.g., transpilers, bundlers, etc.) – Changed codes should just work if I reload the page. • Relatively small download size (under 1 MB) • Clean separation of UI and core logic

Slide 29

Slide 29 text

v0.5 Architecture asyncio Application A Separate Terminal App Thread aiomonitor Thread Telnet client telnet ui loop REPL with start_monitor(): .monitor .task asyncio event loop .telnet .monitor

Slide 30

Slide 30 text

v0.6: Restructuring frontends asyncio Application A Separate Terminal App Thread aiomonitor Thread Telnet client asyncio event loop ui loop REPL with start_monitor(): .termui .webui .monitor .task .telnet Web Browser htmx mustache.js alpine.js tailwind.css 127 KB 365 KB (non-gzipped)

Slide 31

Slide 31 text

Demo Improving debuggability of complex asyncio applications

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Case studies Improving debuggability of complex asyncio applications

Slide 35

Slide 35 text

Case: Backend.AI “hangs" • Symptom: No error logs, but no longer user requests processed some time after startup. • Analysis: Agent's process_lifecycle_events() task to receive Docker's container status updates disappears... – Mis-used asyncio.TaskGroup to keep track of multiple fired tasks along a long-running loop, which caused termination of the entire group & loop upon any unhandled child exception This became the motivation behind aiotools.PersistentTaskGroup as well • aiomonitor helped analysis with the task termination logs.

Slide 36

Slide 36 text

Case: Backend.AI “hogs the CPU" • Symptom: Slow-down and eventual hang up. Happened in a specific customer site only. • Analysis: One user has put too many files in a storage folder (more than millions) which was periodically scanned to get the disk usage and file count recursively. – The latency of a scan task became longer than the interval. – The tasks were indefinitely accumulated (observed up to 6,000), competing for the same executor pool for synchronous I/O. • aiomonitor helped analysis with the live task monitoring.

Slide 37

Slide 37 text

Case: Hidden bomb inside a 3rd party dependency • Symptom: Memory leak after upgrading grpcio to a specific version • Analysis: There was an internal bug in grpcio when using the secure gRPC channel to acess etcd with explicit credentials, not cleaning up a specific callback task. – grpc/grpc#25449 – There were too many dangling tasks, which were generated as an internal callback of grpcio. • aiomonitor helped analysis with the task creation chain tracker.

Slide 38

Slide 38 text

Summary Improving debuggability of complex asyncio applications

Slide 39

Slide 39 text

Lessons • Debuggability = Observability + Reproducibility – Live inspection & Logging • asyncio needs a unique tooling for debugging! • We should continuously invest at improving our toolchains for debugging. – There may seem no visible direct returns, but when bad things happen, it will save your time.

Slide 40

Slide 40 text

Future Work • Top-like activity monitor – e.g., most/least recently yielded tasks – Requires more decent support of tracing in asyncio... • Adopting subinterpreters for better reliability • Performance optimization & better pagination • Scoped live task view – A structured async app = a tree of task groups – List of persistent task groups (or supervisors) – Per-taskgroup task list view

Slide 41

Slide 41 text

Side Work • Expanding structured concurrency to long-running task groups – "TaskScope" / "Supervisor" (experimenting in aiotools) – ExceptionGroup → Exception Generator? (asyncio.as_completed) • Safe synchronous cancellation of tasks – https://github.com/python/cpython/issues/103486 – Graceful shutdown with timeouts? • Still under discussion & exploration with some CPython core devs...

Slide 42

Slide 42 text