$30 off During Our Annual Pro Sale. View Details »

[PyCon US 2023] Improving debuggability of complex asyncio applications

[PyCon US 2023] Improving debuggability of complex asyncio applications

Joongi Kim

April 29, 2023
Tweet

More Decks by Joongi Kim

Other Decks in Programming

Transcript

  1. Improving debuggability
    of complex asyncio
    applications
    Joongi Kim [dzungi kim] / Lablup Inc.
    PyCon US 2023

    View Slide

  2. About me
    • PyCon KR/JP/APAC speaker
    (first in PyCon US!)
    • Seoul, South Korea
    • CTO & Co-founder of Lablup Inc.
    • PhD in Computer Science
    (packet processing acceleration
    using DPDK and GPUs)
    • A humble asyncio ecosystem
    contributor
    • Hobby pianist

    View Slide

  3. About Lablup
    • Co-founded in 2015
    • Key product:
    – MLDev/MLOps platform with highlights on GPU resource mgmt.
    – Powering 70+ enterprise customers (incl. a largest GPU cluster in
    Latin America) on 3 continents with Python asyncio!
    Used asyncio since 2014 (Python 3.4) / ~65K LoC
    Sibling open source projects: aiotools, Callosum, aiomonitor-ng
    Contributed: aiohttp, aiodocker, janus, async-timeout, asyncio
    – Model: Open core + proprietary enterprise plugins & tech support
    https://github.com/lablup/backend.ai

    View Slide

  4. Challenges in debugging asyncio apps
    Improving debuggability of complex asyncio applications

    View Slide

  5. Production debugging
    • The hard problems
    – Memory leak / CPU hogging
    – Unexpected termination of critical long-running tasks
    – Loss of vital error logs
    – When these are combined with 3rd party codes that I have no
    control...
    • How to handle these in asyncio-based apps?

    View Slide

  6. Solutions for “hard" problems
    Prevention
    via structured
    concurrency[1]
    Debugging via
    postmortem
    asyncio.TaskGroup
    aio-libs/{aiomonitor, aioconsole}
    asyncio.all_tasks()
    Debugging via
    live inspection
    achimnol/aiomonitor-ng
    Task termination chains
    achimnol/aiomonitor-ng
    Task creation chains
    asyncio
    3rd parties
    asyncio.Supervisor
    asyncio.as_completed_safe()
    trio.open_nursery()
    [1] https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/
    aiotools.PersistentTaskGroup
    aiotools.as_completed_safe()
    ?
    ?
    Traceback & logging
    N/A

    View Slide

  7. Traceback in asyncio
    • Python traceback gives:
    – The type and arguments of the exception
    – The location where the exception is raised in the stack
    • Are these sufficient?
    – What if a 3rd-party code swallows the cancellation?
    – What if a task exits with an error but its garbage collection is
    delayed (e.g., due to extra references from long-lived objects)?
    File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2
    await asyncio.sleep(100)
    File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep
    return await future
    ^^^^^^^^^^^^
    asyncio.exceptions.CancelledError

    View Slide

  8. Traceback in asyncio
    • We also need to know:
    – Who cancelled this task?
    e.g., Where did my precious task go?!
    – Who created this task?
    e.g., When there are unexpectedly many number of tasks...
    File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2
    await asyncio.sleep(100)
    File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep
    return await future
    ^^^^^^^^^^^^
    asyncio.exceptions.CancelledError

    View Slide

  9. The world of async tasks
    My Code
    3rd Party Code
    My Code
    3rd Party Code
    My Code
    My Code
    task stack
    Error!
    Swallows
    cancellation?
    raise
    cancel()
    await
    ...
    ...
    My Code
    3rd Party Code
    create_task()
    await
    await
    An asyncio app is a giant tree
    of multiple tasks with their
    own stacks, interacting with
    each other.
    callback
    task stack

    View Slide

  10. aiomonitor-ng
    Improving debuggability of complex asyncio applications

    View Slide

  11. A brief architecture of aiomonitor
    asyncio Application A Separate Terminal
    App Thread
    aiomonitor
    Thread
    Telnet client
    asyncio
    event loop
    telnet
    ui loop
    REPL
    with start_monitor():

    View Slide

  12. Quick start
    import aiomonitor
    async def main():
    loop = asyncio.get_running_loop()
    run_forever = loop.create_future()
    with aiomonitor.start_monitor(loop):
    await run_forever
    try:
    asyncio.run(main())
    except KeyboardInterrupt:
    pass
    $ pip install aiomonitor-ng
    $ edit test.py
    $ python test.py &
    $ telnet localhost 50101

    View Slide

  13. Quick start
    monitor >>> help
    Usage: help [OPTIONS] COMMAND [ARGS]...
    To see the usage of each command, run them with "--help" option.
    Commands:
    cancel Cancel an indicated task
    console Switch to async Python REPL
    exit (q,quit) Leave the monitor client session
    help (?,h) Show the list of commands
    ps (p) Show task table
    ps-terminated (pst,pt) List recently terminated/cancelled tasks
    signal Send a Unix signal
    stacktrace (st,stack) Print a stack trace from the event loop thread
    where (w) Show stack frames and the task creation chain of a task
    where-terminated (wt) Show stack frames and the termination/cancellation chain of a task

    View Slide

  14. Additions in aiomonitor-ng
    • The task creation chain tracker
    • The task termination tracker
    • The task cancellation chain tracker
    • prompt_toolkit + Click integration for rich CLI

    View Slide

  15. Task creation chain tracker
    • Q: Who created this task and its parents?
    • Implementation
    – class TracedTrask(asyncio.Task)
    Thanks to asyncio.set_task_factory()
    – Store weakref of tasks in a bookkeeper
    – Traverse through the chain of weakrefs upon where command
    • Assumption
    – Parent tasks are alive while child tasks are alive. (Lifetime inclusion)
    Structured concurrency and asyncio.TaskGroup encourage this.
    It makes easier to prevent memory leak in the creation chain tracker.

    View Slide

  16. Task C
    Task B
    Task A
    await ...()
    ...()
    create_task(...)
    await ...()
    create_task(...)
    ...()
    await ...()
    await ...()
    create_task(...)
    Task D
    await ...()
    await ...()
    weak keys
    ...()
    Task A
    Task B
    Task C
    Task D
    _created_traceback_chain: WeakKeyDictionary[Task, weakref.ReferenceType[Task])
    Task creation chain tracker
    _created_tracebacks: WeakKeyDictionary[Task, list[traceback.FrameSummary])
    Task A
    Task B
    Task B
    stack A
    stack B
    stack C
    stack D
    weak
    values

    View Slide

  17. Task termination tracker
    • Q. Why is this task terminated (even it was a long before...)?
    • Implementation
    – Register _trace_termination() as the task done callback
    – get_trace_id(): Task object's ID is no longer valid after gc.
    Combine task ID and name (default: Task-XXX) to mangle a
    unique identifier to refer in the termination logs
    – Prevent memory leak by limiting the maximum length of history
    – Add @preserve_termination_logs decorator to mark specific
    coroutine task's termination log forever

    View Slide

  18. Task cancellation chain tracker
    • Q: Who cancelled this task, which also may be cancelled by another?
    • Implementation
    – Override Task.cancel() to store CancellationChain
    current_task() in cancel() is the canceller task,
    while self in cancel() is the cancellee task.
    We can extract the canceller's stack trace at this point.
    – Deligate postprocessing to the monitor UI thread which holds the
    tracker objects.
    Uses janus for sync/async queue interfacing.
    Removes tracker objects of cancelled/terminated tasks when
    necessary (e.g., reached the history limit)

    View Slide

  19. Task B
    Task X
    await ...()
    ...()
    task_b.cancel(...)
    await ...()
    _termination_info_queue: janus.Queue[TerminatedTaskInfo]
    Task termination & cancellation chain tracker
    _cancellation_chain_queue: janus.Queue[CancellationChain]
    Monitor UI thread
    App thread
    await ...()
    canceller cancellee
    cancel()
    extract_stack(task)
    filter_stack(task)
    cleanup_trackers()
    task done callback TracedTask.cancel()

    View Slide

  20. prompt_toolkit + Click integration
    • aiomonitor-ng
    – Multiple concurrent telnet sessions
    (resolving aio-libs/aiomonitor#140)
    – Auto-completion of commands and task IDs / filtering the task list
    • Implementation
    – Combines several contextvars to keep track of:
    The monitor instance
    The current stdout buffer linked with the current telnet
    connection
    – Signals Click command completion with asyncio.Event
    – Defines a custom ClickCompleter adapter

    View Slide

  21. Demo
    Improving debuggability of complex asyncio applications

    View Slide

  22. Live task list (abbrev. of long paths)

    View Slide

  23. Filtering the live task list
    by coro/task names
    List of terminated /
    cancelled tasks

    View Slide

  24. A part of task creation chain trace showing that a 3rd party
    library (grpcio) creates internal sub-tasks
    ← my code
    ← 3rd party code

    View Slide

  25. Case studies
    Improving debuggability of complex asyncio applications

    View Slide

  26. Case: Backend.AI “hangs"
    • Symptom: No error logs, but no longer user requests processed
    some time after startup.
    • Analysis: Agent's process_lifecycle_events() task to receive Docker's
    container status updates disappears...
    – Mis-used asyncio.TaskGroup to keep track of multiple fired tasks
    along a long-running loop, which caused termination of the entire
    group & loop upon any unhandled child exception
    This became the motivation behind
    aiotools.PersistentTaskGroup as well
    • aiomonitor-ng helped analysis with the task termination logs.

    View Slide

  27. Case: Backend.AI “hogs the CPU"
    • Symptom: Slow-down and eventual hang up. Happened in a specific
    customer site only.
    • Analysis: One user has put too many files in a storage folder (more
    than millions) which was periodically scanned to get the disk usage
    and file count recursively.
    – The latency of a scan task became longer than the interval.
    – The tasks were indefinitely accumulated (observed up to 6,000),
    competing for the same executor pool for synchronous I/O.
    • aiomonitor-ng helped analysis with the live task monitoring.

    View Slide

  28. Case: Hidden bomb inside a 3rd party dependency
    • Symptom: Memory leak after upgrading grpcio to a specific version
    • Analysis: There was an internal bug in grpcio when using the secure
    gRPC channel to acess etcd with explicit credentials, not cleaning
    up a specific callback task.
    – grpc/grpc#25449
    – There were too many dangling tasks, which were generated as an
    internal callback of grpcio.
    • aiomonitor helped analysis with the task creation chain tracker.

    View Slide

  29. Summary
    Improving debuggability of complex asyncio applications

    View Slide

  30. Lessons
    • Debuggability = Observability + Reproducibility
    – Live inspection & Logging
    • asyncio needs a unique tooling for debugging!
    • We should continuously invest at improving our toolchains for
    debugging.
    – There may seem no visible direct returns, but when bad things
    happen, it will save your time.

    View Slide

  31. Future Work
    • Scoped live task view
    – A structured async app = a tree of task groups
    – List of persistent task groups (or supervisors)
    – Per-taskgroup task list view
    • Top-like activity monitor (e.g., most/least recently yielded tasks)
    • Adopting subinterpreters for better reliability
    • Performance optimization / Better pagination
    – Especially when there are thousands of tasks...

    View Slide

  32. View Slide