Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[PyCon US 2023] Improving debuggability of comp...

[PyCon US 2023] Improving debuggability of complex asyncio applications

Joongi Kim

April 29, 2023
Tweet

More Decks by Joongi Kim

Other Decks in Programming

Transcript

  1. About me • PyCon KR/JP/APAC speaker (first in PyCon US!)

    • Seoul, South Korea • CTO & Co-founder of Lablup Inc. • PhD in Computer Science (packet processing acceleration using DPDK and GPUs) • A humble asyncio ecosystem contributor • Hobby pianist
  2. About Lablup • Co-founded in 2015 • Key product: –

    MLDev/MLOps platform with highlights on GPU resource mgmt. – Powering 70+ enterprise customers (incl. a largest GPU cluster in Latin America) on 3 continents with Python asyncio! Used asyncio since 2014 (Python 3.4) / ~65K LoC Sibling open source projects: aiotools, Callosum, aiomonitor-ng Contributed: aiohttp, aiodocker, janus, async-timeout, asyncio – Model: Open core + proprietary enterprise plugins & tech support https://github.com/lablup/backend.ai
  3. Production debugging • The hard problems – Memory leak /

    CPU hogging – Unexpected termination of critical long-running tasks – Loss of vital error logs – When these are combined with 3rd party codes that I have no control... • How to handle these in asyncio-based apps?
  4. Solutions for “hard" problems Prevention via structured concurrency[1] Debugging via

    postmortem asyncio.TaskGroup aio-libs/{aiomonitor, aioconsole} asyncio.all_tasks() Debugging via live inspection achimnol/aiomonitor-ng Task termination chains achimnol/aiomonitor-ng Task creation chains asyncio 3rd parties asyncio.Supervisor asyncio.as_completed_safe() trio.open_nursery() [1] https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ aiotools.PersistentTaskGroup aiotools.as_completed_safe() ? ? Traceback & logging N/A
  5. Traceback in asyncio • Python traceback gives: – The type

    and arguments of the exception – The location where the exception is raised in the stack • Are these sufficient? – What if a 3rd-party code swallows the cancellation? – What if a task exits with an error but its garbage collection is delayed (e.g., due to extra references from long-lived objects)? File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError
  6. Traceback in asyncio • We also need to know: –

    Who cancelled this task? e.g., Where did my precious task go?! – Who created this task? e.g., When there are unexpectedly many number of tasks... File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError
  7. The world of async tasks My Code 3rd Party Code

    My Code 3rd Party Code My Code My Code task stack Error! Swallows cancellation? raise cancel() await ... ... My Code 3rd Party Code create_task() await await An asyncio app is a giant tree of multiple tasks with their own stacks, interacting with each other. callback task stack
  8. A brief architecture of aiomonitor asyncio Application A Separate Terminal

    App Thread aiomonitor Thread Telnet client asyncio event loop telnet ui loop REPL with start_monitor():
  9. Quick start import aiomonitor async def main(): loop = asyncio.get_running_loop()

    run_forever = loop.create_future() with aiomonitor.start_monitor(loop): await run_forever try: asyncio.run(main()) except KeyboardInterrupt: pass $ pip install aiomonitor-ng $ edit test.py $ python test.py & $ telnet localhost 50101
  10. Quick start monitor >>> help Usage: help [OPTIONS] COMMAND [ARGS]...

    To see the usage of each command, run them with "--help" option. Commands: cancel Cancel an indicated task console Switch to async Python REPL exit (q,quit) Leave the monitor client session help (?,h) Show the list of commands ps (p) Show task table ps-terminated (pst,pt) List recently terminated/cancelled tasks signal Send a Unix signal stacktrace (st,stack) Print a stack trace from the event loop thread where (w) Show stack frames and the task creation chain of a task where-terminated (wt) Show stack frames and the termination/cancellation chain of a task
  11. Additions in aiomonitor-ng • The task creation chain tracker •

    The task termination tracker • The task cancellation chain tracker • prompt_toolkit + Click integration for rich CLI
  12. Task creation chain tracker • Q: Who created this task

    and its parents? • Implementation – class TracedTrask(asyncio.Task) Thanks to asyncio.set_task_factory() – Store weakref of tasks in a bookkeeper – Traverse through the chain of weakrefs upon where command • Assumption – Parent tasks are alive while child tasks are alive. (Lifetime inclusion) Structured concurrency and asyncio.TaskGroup encourage this. It makes easier to prevent memory leak in the creation chain tracker.
  13. Task C Task B Task A await ...() ...() create_task(...)

    await ...() create_task(...) ...() await ...() await ...() create_task(...) Task D await ...() await ...() weak keys ...() Task A Task B Task C Task D _created_traceback_chain: WeakKeyDictionary[Task, weakref.ReferenceType[Task]) Task creation chain tracker _created_tracebacks: WeakKeyDictionary[Task, list[traceback.FrameSummary]) Task A Task B Task B stack A stack B stack C stack D weak values
  14. Task termination tracker • Q. Why is this task terminated

    (even it was a long before...)? • Implementation – Register _trace_termination() as the task done callback – get_trace_id(): Task object's ID is no longer valid after gc. Combine task ID and name (default: Task-XXX) to mangle a unique identifier to refer in the termination logs – Prevent memory leak by limiting the maximum length of history – Add @preserve_termination_logs decorator to mark specific coroutine task's termination log forever
  15. Task cancellation chain tracker • Q: Who cancelled this task,

    which also may be cancelled by another? • Implementation – Override Task.cancel() to store CancellationChain current_task() in cancel() is the canceller task, while self in cancel() is the cancellee task. We can extract the canceller's stack trace at this point. – Deligate postprocessing to the monitor UI thread which holds the tracker objects. Uses janus for sync/async queue interfacing. Removes tracker objects of cancelled/terminated tasks when necessary (e.g., reached the history limit)
  16. Task B Task X await ...() ...() task_b.cancel(...) await ...()

    _termination_info_queue: janus.Queue[TerminatedTaskInfo] Task termination & cancellation chain tracker _cancellation_chain_queue: janus.Queue[CancellationChain] Monitor UI thread App thread await ...() canceller cancellee cancel() extract_stack(task) filter_stack(task) cleanup_trackers() task done callback TracedTask.cancel()
  17. prompt_toolkit + Click integration • aiomonitor-ng – Multiple concurrent telnet

    sessions (resolving aio-libs/aiomonitor#140) – Auto-completion of commands and task IDs / filtering the task list • Implementation – Combines several contextvars to keep track of: The monitor instance The current stdout buffer linked with the current telnet connection – Signals Click command completion with asyncio.Event – Defines a custom ClickCompleter adapter
  18. A part of task creation chain trace showing that a

    3rd party library (grpcio) creates internal sub-tasks ← my code ← 3rd party code
  19. Case: Backend.AI “hangs" • Symptom: No error logs, but no

    longer user requests processed some time after startup. • Analysis: Agent's process_lifecycle_events() task to receive Docker's container status updates disappears... – Mis-used asyncio.TaskGroup to keep track of multiple fired tasks along a long-running loop, which caused termination of the entire group & loop upon any unhandled child exception This became the motivation behind aiotools.PersistentTaskGroup as well • aiomonitor-ng helped analysis with the task termination logs.
  20. Case: Backend.AI “hogs the CPU" • Symptom: Slow-down and eventual

    hang up. Happened in a specific customer site only. • Analysis: One user has put too many files in a storage folder (more than millions) which was periodically scanned to get the disk usage and file count recursively. – The latency of a scan task became longer than the interval. – The tasks were indefinitely accumulated (observed up to 6,000), competing for the same executor pool for synchronous I/O. • aiomonitor-ng helped analysis with the live task monitoring.
  21. Case: Hidden bomb inside a 3rd party dependency • Symptom:

    Memory leak after upgrading grpcio to a specific version • Analysis: There was an internal bug in grpcio when using the secure gRPC channel to acess etcd with explicit credentials, not cleaning up a specific callback task. – grpc/grpc#25449 – There were too many dangling tasks, which were generated as an internal callback of grpcio. • aiomonitor helped analysis with the task creation chain tracker.
  22. Lessons • Debuggability = Observability + Reproducibility – Live inspection

    & Logging • asyncio needs a unique tooling for debugging! • We should continuously invest at improving our toolchains for debugging. – There may seem no visible direct returns, but when bad things happen, it will save your time.
  23. Future Work • Scoped live task view – A structured

    async app = a tree of task groups – List of persistent task groups (or supervisors) – Per-taskgroup task list view • Top-like activity monitor (e.g., most/least recently yielded tasks) • Adopting subinterpreters for better reliability • Performance optimization / Better pagination – Especially when there are thousands of tasks...