• CTO & Co-founder of Lablup Inc. • PhD in Computer Science (packet processing acceleration using DPDK and GPUs & Xeon Phi) • A humble asyncio ecosystem contributor • Hobby pianist APAC 2016 in Seoul US 2023 in Salt Lake City JP 2022 in Tokyo
MLDev/MLOps platform with highlights on GPU resource mgmt. – Powering 70+ enterprise customers (incl. a largest GPU cluster in Latin America) on 3 continents with Python asyncio! Used asyncio since 2014 (Python 3.4) / ~70K LoC Sibling open source projects: aiotools, Callosum, aiomonitor-ng Contributed: aiohttp, aiodocker, janus, async-timeout, asyncio – Model: Open core + proprietary enterprise plugins & tech support https://github.com/lablup/backend.ai
Backported aiomonitor-ng – I became the new maintainer! • aiomonitor v0.6 (Aug 27th) – GUI made with htmx + alpine.js + tailwind.css – Reference Breaking Down Barriers: A Unified Approach to Web Development with django-htmx Track 2, Bae KwonHan
CPU hogging – Unexpected termination of critical long-running tasks – Loss of vital error logs – When these are combined with 3rd party codes that I have no control... – When these happen in remote, air-gapped clusters... • How to handle these in asyncio-based apps?
and arguments of the exception – The location where the exception is raised in the stack • Are these sufficient? – What if a 3rd-party code swallows the cancellation? – What if a task exits with an error but its garbage collection is delayed (e.g., due to extra references from long-lived objects)? File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError
Who cancelled this task? e.g., Where did my precious task go?! – Who created this task? e.g., When there are an unexpectedly large number of tasks... File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError
My Code 3rd Party Code My Code My Code task stack Error! Swallows cancellation? raise cancel() await ... ... My Code 3rd Party Code create_task() await await An asyncio app is a giant tree of multiple tasks with their own stacks, interacting with each other. callback task stack
To see the usage of each command, run them with "--help" option. Commands: cancel Cancel an indicated task console Switch to async Python REPL exit (q,quit) Leave the monitor client session help (?,h) Show the list of commands ps (p) Show task table ps-terminated (pst,pt) List recently terminated/cancelled tasks signal Send a Unix signal stacktrace (st,stack) Print a stack trace from the event loop thread where (w) Show stack frames and the task creation chain of a task where-terminated (wt) Show stack frames and the termination/cancellation chain of a task
and its parents? • Implementation – class TracedTrask(asyncio.Task) Thanks to asyncio.set_task_factory() – Store weakref of tasks in a bookkeeper – Traverse through the chain of weakrefs upon where command • Assumption – Parent tasks are alive while child tasks are alive. (Lifetime inclusion) Structured concurrency and asyncio.TaskGroup encourage this. It makes easier to prevent memory leak in the creation chain tracker.
await ...() create_task(...) ...() await ...() await ...() create_task(...) Task D await ...() await ...() weak keys ...() Task A Task B Task C Task D _created_traceback_chain: WeakKeyDictionary[Task, weakref.ReferenceType[Task]) Task creation chain tracker _created_tracebacks: WeakKeyDictionary[Task, list[traceback.FrameSummary]) Task A Task B Task B stack A stack B stack C stack D weak values
(even it was a long before...)? • Implementation – Register _trace_termination() as the task done callback – get_trace_id(): Task object's ID is no longer valid after gc. Combine task ID and name (default: Task-XXX) to mangle a unique identifier to refer in the termination logs – Prevent memory leak by limiting the maximum length of history – Add @preserve_termination_logs decorator to mark specific coroutine task's termination log forever
which also may be cancelled by another? • Implementation – Override Task.cancel() to store CancellationChain current_task() in cancel() is the canceller task, while self in cancel() is the cancellee task. We can extract the canceller's stack trace at this point. – Deligate postprocessing to the monitor UI thread which holds the tracker objects. Uses janus for sync/async queue interfacing. Removes tracker objects of cancelled/terminated tasks when necessary (e.g., reached the history limit)
sessions (resolving aio-libs/aiomonitor#140) – Auto-completion of commands and task IDs / filtering the task list • Implementation – Combines several contextvars to keep track of: The monitor instance The current stdout buffer linked with the current telnet connection – Signals Click command completion with asyncio.Event – Defines a custom ClickCompleter adapter
backend developers – The core logic should live at the server-side code. – Let's use a "good default" design framework! 🌞 • Lightweight dev setup – No additional build chain for Javascript (e.g., transpilers, bundlers, etc.) – Changed codes should just work if I reload the page. • Relatively small download size (under 1 MB) • Clean separation of UI and core logic
longer user requests processed some time after startup. • Analysis: Agent's process_lifecycle_events() task to receive Docker's container status updates disappears... – Mis-used asyncio.TaskGroup to keep track of multiple fired tasks along a long-running loop, which caused termination of the entire group & loop upon any unhandled child exception This became the motivation behind aiotools.PersistentTaskGroup as well • aiomonitor helped analysis with the task termination logs.
hang up. Happened in a specific customer site only. • Analysis: One user has put too many files in a storage folder (more than millions) which was periodically scanned to get the disk usage and file count recursively. – The latency of a scan task became longer than the interval. – The tasks were indefinitely accumulated (observed up to 6,000), competing for the same executor pool for synchronous I/O. • aiomonitor helped analysis with the live task monitoring.
Memory leak after upgrading grpcio to a specific version • Analysis: There was an internal bug in grpcio when using the secure gRPC channel to acess etcd with explicit credentials, not cleaning up a specific callback task. – grpc/grpc#25449 – There were too many dangling tasks, which were generated as an internal callback of grpcio. • aiomonitor helped analysis with the task creation chain tracker.
& Logging • asyncio needs a unique tooling for debugging! • We should continuously invest at improving our toolchains for debugging. – There may seem no visible direct returns, but when bad things happen, it will save your time.
yielded tasks – Requires more decent support of tracing in asyncio... • Adopting subinterpreters for better reliability • Performance optimization & better pagination • Scoped live task view – A structured async app = a tree of task groups – List of persistent task groups (or supervisors) – Per-taskgroup task list view