Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High Performance FastAPI EN

High Performance FastAPI EN

PyCon JP 2021 presentation

Ikuo Suyama

October 15, 2021
Tweet

More Decks by Ikuo Suyama

Other Decks in Programming

Transcript

  1. High Performance
    FastAPI
    Ikuo Suyama

    View full-size slide

  2. 陶山 育男/Ikuo Suyama
    @martin_lover_se
    SmartNews, Inc.
    ✤ Ads Backend Engineer
    ✤ Internet ads specialist
    ✤ "Nothing beats ads, not even food "
    ✤ Usually use JVM, about one year of
    experience using Python
    ✤ Please go easy on me… 🙇

    View full-size slide

  3. Session Takeways
    ✤ Profiling methods and identification of
    bottlenecks
    ✤ Specific challenges faced in our environment
    and countermeasures
    Python Web Application
    Detailed method of performance tuning

    View full-size slide

  4. 1. Introduction
    2. Load Testing and Profiling
    3. Problems Faced and Countermeasures
    4. Summary
    Agenda

    View full-size slide

  5. 1. Introduction
    2. Load Testing and Profiling
    3. Problems Faced and Countermeasures
    4. Summary
    Agenda

    View full-size slide

  6. About SmartNews
    6
    20 million people per month
    Number of users
    Used for about 16.7 minutes*2
    per day
    Per person
    (Consolidated for Japan and
    US, as of August 2019)
    No. 1
    News App
    User Base*1
    - An everyday habit of consumers
    makes the largest user base in Japan
    *1. Source: Nielsen Mobile NetView as of January 2021 (Calculation of SmartNews App's user base based on number of
    installs of SmartNews App)
    *In-house figures, average for
    January 2021

    View full-size slide

  7. Python in SmartNews - Coupon Channel
    news-service
    coupon-service
    api-gateway
    coupon-admin
    Redis
    RDB
    Adopting FastAPI for web servers

    View full-size slide

  8. • Flask-like annotation-based routing API
    • Compatible with OpenAPI
    • Simple DI function
    • Easy to learn, development can begin quickly
    FastAPI is a modern, fast (high-performance), web framework for building APIs
    with Python 3.6+ based on standard Python type hints. 
    - FastAPI
    Makes prompt service launch possible!

    View full-size slide

  9. FastAPI - Seeing is believing

    View full-size slide

  10. Full Rollout…
    news-service
    coupon-service
    api-gateway
    coupon-admin
    Redis
    RDB
    SmartNews power!!
    🔥
    Performance tuning

    View full-size slide

  11. 1. Introduction
    2. Load Testing and Profiling
    3. Problems Faced and Countermeasures
    4. Summary
    Agenda

    View full-size slide

  12. Performance Tuning Mindset
    1. Determine
    performance targets
    2. Identify measurements and
    bottlenecks
    3. Resolve bottlenecks
    Met performance targets?
    END 🎉
    YES
    NO

    View full-size slide

  13. "Performance" of Web Applications
    You can go on and on if there is no target!
    Throughput
    How many requests can be
    handled in a short time?
     
    ^(Because it's so much fun!)
    Latency
    How much time does it take to
    deal with one request?
    2-1. Determine performance targets

    View full-size slide

  14. L … the average number of items in a queuing system
    λ … the average number of items arriving at the system per
    unit of time
    W … the average waiting time an item spends in a queuing
    system
    In terms of web application performance, this
    means...
    Little's theorem
    2-1. Determine performance targets

    View full-size slide

  15. Example: Handled a 1,000 rps system with 50 cores
      Supposing the use of ten 5 core instances (or Pods)
    Since it should be okay to handle 1000/10 = 100rps per 5core instance,
    set the performance target per instance as:
    Throughput 100rps (or more) , Latency 50ms (or less)
    Note: It may not always be possible to use the CPU 100%, I/O or concurrency also need to be considered
    and the number of CPU cores that can be used in K8s pod does not correspond with the limit, so
    the calculation is actually quite complex, and this is just a rough estimate.
    Method of determining performance targets: Example
    2-1. Determine performance targets

    View full-size slide

  16. Rule 1: You can't tell where a program is going to spend its time.
    Bottlenecks occur in surprising places,
    so don't try to second guess and put in a speed hack until you've proven that's where
    the bottleneck is.
    Rule 2: Measure Don't tune for speed until you've measured,.
    and even then don't unless one part of the code overwhelms the rest.
    - Rob Pike: Notes on Programming in C
    - [Wikipedia] Pike: Notes on Programming in C (Japanese)
    2-2.Identify measurements and bottlenecks
    and there is no point is tuning what is not a bottleneck.
    Do not guess, measure.
    Bottlenecks cannot be identified without measuring,

    View full-size slide

  17. 1. Determine the
    performance targets
    2. Identify measurements and
    bottlenecks
    3. Resolve bottlenecks
    Met performance targets?
    END 🎉
    YES
    NO
    Add load and
    check the load situation
    of the whole system
    Check the application's load situation
    Recreate the same conditions locally as much
    as possible, and measure while adding load
    Correct the part that is most likely
    to be the bottleneck 
    Is the bottleneck resolved?
    Has performance improved?
    Deploy in the load testing environment
    and measure
    Is the application the bottleneck?
     
    Prepare the environment for
    load testing
    2-2. Identify measurements and
    bottlenecks
    Correct until the bottleneck
    becomes part of the application
    YES
    NO
    YES
    NO
    Tools for measurement
    APM
    fastapi_profiler
    py-spy
    Speed up the feedback
    loop
     

    View full-size slide

  18. • A benchmark load testing tool made with Python
    • Writes scenarios with Python API
    • Simple and adequate settings
    • Can be run in clusters for large-scale loads
    Locust is an easy to use, scriptable and
    scalable performance testing tool.
    - LOCUST
    2-2. Identify measurements and bottlenecks
    Add load: LOCUST

    View full-size slide

  19. Add load: LOCUST Specify the number of
    simultaneous
    connections
    Run clusters of rps,
    median/95% tile latency
    Add load until latency worsens while increasing the number of
    simultaneous connections, and check the maximum throughput and
    2-2. Identify measurements and bottlenecks

    View full-size slide

  20. • Check the resource situation of the whole system while load is added
    • Q: Has the application become the bottleneck?
    • There is no point in tuning the application until the application turns into the
    bottleneck.
    • Middleware, such as databases, usually become the bottleneck first.
    • Conduct tuning until the application throughput stops increasing even
    though resources, such as upstream services and connecting middleware,
    have capacity available.
    • Today, I will only be talking about tuning application.🙇
    Check the load situation of the whole system:
    Datadog
    So, the rest of the presentation is based on the assumption that the
    application is the bottleneck.
    2-2. Identify measurements and bottlenecks

    View full-size slide

  21. Check the number of requests for
    each service
    and the usage of CPU/memory, etc.
    Check the CPU usage of
    middleware and
    the changes in the number of
    connections
    Although detailed status can also be checked by attaching data to containers,
    this is very convenient for grasping the big picture
    2-2. Identify measurements and bottlenecks

    View full-size slide

  22. • Easy to set up, sidecar and a few lines of launch code
    • Gets data from the real operating environment
    • Helps find bottlenecks as most of the necessary information is
    available
    • But, it is expensive 💸, and there is a need to find workarounds,
    such as applying only to specific instances...
    Application measurement (1): Datadog APM
    Datadog APM and Continuous Profiler provide detailed visibility into the application with
    standard performance dashboards for web services, queues, and databases to
    monitor requests, errors, and latency.
    — Datadog APM
    2-2. Identify measurements and bottlenecks

    View full-size slide

  23. Application measurement (1): Datadog APM
    Latency histogram
    CPU time by scripts
    CPU time by function
    This is taking up maximum time
    and
    is likely to be a bottleneck
    2-2. Identify measurements and bottlenecks

    View full-size slide

  24. Application measurement (2): fastapi_profiler
    A FastAPI middleware of pyinstrument to check your
    service code performance.
    — fastapi_profiler
    • Integrate pyinstrument as a FastAPI middleware
    • Useful for measuring CPU time while modifying code locally
    • Cannot be used in the actual application or under load because
    the impact on performance is high
    2-2. Identify measurements and bottlenecks

    View full-size slide

  25. Application measurement (2): fastapi_profiler
    add_middleware
    resolves the problem
    The sorting method is slow
    = possible bottleneck
    2-2. Identify measurements and bottlenecks

    View full-size slide

  26. • Rust profiler, extremely low overhead
    • Useful for quickly measuring CPU in a local/load environment
    • Also useful when GIL or multi-threaded processing could be the
    bottleneck
    • fastapi_profiler cannot measure threads that are not under
    Application measurement (3): py-spy
    It lets you visualize what your Python program is
    spending time on without restarting the program or
    modifying the code in any way.
    — py-spy
    2-2. Identify measurements and bottlenecks

    View full-size slide

  27. This method takes up most of
    the endpoint processing time
    Application measurement (3): py-spy
    2-2. Identify measurements and bottlenecks

    View full-size slide

  28. 1. Introduction
    2. Load Testing and Profiling
    3. Problems Faced and Countermeasures
    4. Summary
    Agenda

    View full-size slide

  29. Configuration and load characteristics for
    deploying FastAPI
    Gunicorn supports working as a process manager and allowing users to tell it
    which specific worker process class to use. Then Gunicorn would start one or
    more worker processes using that class.
    And Uvicorn has a Gunicorn-compatible worker class.
    — Server Workers - Gunicorn with Uvicorn



    Launch the process and
    move forks and sockets
    to workers
    Request/response
    Asynchronous
    processing
    Request
    Response
    Func call
    callback
    Extends

    Worker
    Applications
    Business logic

    View full-size slide

  30. Configuration and load characteristics for
    deploying FastAPI



    Request
    Response
    Func call
    callback
    Extends

    Worker
    there is no impact on
    performance because it
    only deals with process
    management
     
    It is not faster than
    Starlette, but...
    Fastest Python ASGI
    Items below Starlette did not pose a bottleneck this time
    Gunicorn supports working as a process manager and allowing users to tell it
    which specific worker process class to use. Then Gunicorn would start one or
    more worker processes using that class.
    And Uvicorn has a Gunicorn-compatible worker class.
    — Server Workers - Gunicorn with Uvicorn

    View full-size slide

  31. 1. Slow I/O bound processing
    2. Slow CPU bound processing
    3. Slow logging: Slow MultiThread
    4. Improving overall performance through changes in the
    processing system
    3. Problems Faced

    View full-size slide

  32. 3-1. Slow I/O bound processing
    ✤ Symptoms
    • Throughput does not increase when CPU is not used up.
    • Check CPU us with top, vmstat, etc.
    ✤ Causes
    • Processing that requires network I/O
    • E.g.: access into DB or other middleware
    • File writing
    • etc., etc.

    View full-size slide

  33. • Use async/await where the network is involved
    • Many compatible libraries
    • E.g.: aioredis/httpx …
    3-1. Slow I/O bound processing
    Countermeasure (1): Use acyncio
    asyncio is a library for writing codes for synchronous
    processing using the async/await syntax .
    — asyncio

    View full-size slide

  34. 3-1. Slow I/O bound processing
    Countermeasure (1): Use acyncio
    I set this as default for anything involving the network,
    so there was hardly any problems in I/O
    On FastAPI, simple set
    routing method to async
    Use await in async
    method

    View full-size slide

  35. • Still slow wherever the network was involved
    • It takes almost up tp 1000 times more time when compared
    to the memory reference
    • Cache retrieved results in application memory if result
    consistency tolerance is high
    3-1. Slow I/O bound processing
    Countermeasure (2): Cache
    Latency Comparison Numbers (~2012)
    ----------------------------------
    :
    Main memory reference 100 ns
    :
    Round trip within same datacenter 500,000 ns 500 us
    — Latency Numbers Every Programmer Should Know

    View full-size slide

  36. 3-1. Slow I/O bound processing
    Countermeasure (2): Cache
    Simple LRUCache using
    OrderedDict
    Redic access only when
    there is no cache
    Uses cache for each method
    lru_cache annotation is also useful

    View full-size slide

  37. ✤ Symptoms
    • user time shown in top, vmstat, etc., high LA
    • Throttle in k8s environment
    ✤ Main causes
    • Simple heavy calculations such as floating point calculations
    • Inefficient algorithms
    • etc., etc.
    3-2. Slow CPU bound processing
    Now, let's focus on application tuning!

    View full-size slide

  38. • E.g.: URL encoding processing (urllib/parse.py) is slow
    Countermeasure (1): Use prior calculation/cache
    Either calculate offline in advance,
    or bypass heavy calculations by caching calculation results
    urllib/parse.py
    takes up a lot of time
    Cache Url calculation
    results
     
    3-2. Slow CPU bound processing

    View full-size slide

  39. • Measures for items that must be calculated online for each
    request and have a high computational load
    • E.g.: calculating geohash for requested location information
    • Pre-compile to a low-level language such as C
    • Can be partially applied, such as per function
    Countermeasure (2): Cython
    3-2. Slow CPU bound processing
    Cython is an optimising static compiler for both the Python
    programming language and the extended Cython programming
    language
    — Cython

    View full-size slide

  40. Countermeasure (2): Cython
    3-2. Slow CPU bound processing
    Formula to calculate
    geohash(maptile) from specified
    location information
    The caller that compiles by writing
    into .pyx is the same as usual
    Python
    Prepare setup.py like
    this
    Compiles during docker
    build

    View full-size slide

  41. • Loading from CPU Cache is much faster than RAM
    • Optimize the use of CPU cache by using numpy vector
    calculation for loop processing
    • E.g.: calculating the distance to each coupon from the
    requested location information.
    Countermeasure (3): numpy
    3-2. Slow CPU bound processing
    Latency Comparison Numbers (~2012)
    ----------------------------------
    L1 cache reference 0.5 ns
    L2 cache reference 7 ns 14x L1 cache
    :
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    — Latency Numbers Every Programmer Should Know

    View full-size slide

  42. Countermeasure (3): numpy
    3-2. Slow CPU bound processing
    Turn into vectors
    and use vector calculation
    for future requests
    We want to calculating online the
    location information of the request
    and its distance from each coupon
     

    View full-size slide

  43. 3-3. Slow logging: Slow MultiThread
    ✤ Symptoms
    • Uses CPU for threads other than FastAPI
    • Check with Datadog
    • GIL values are low in py-spy top
    ✤ Causes
    • A library called logru was writing logs in MultiThread, and
    that was taking a long time
    • Python can handle threads, but it is not efficient because of
    GIL

    View full-size slide

  44. 3-3. Slow logging: Slow MultiThread
    In CPython, the global interpreter lock, or GIL, is a mutex that protects
    access to Python objects, preventing multiple threads from executing
    Python bytecodes at once.
    — Global Interpreter Lock
    GIL / Global Interpreter Lock
    • Even if you set up threads, only one thread will be run at a time
    in the process
    • In Python, parallel processing is handled by Multi-Processing.
    Combining that with parallel processing with asyncio can solve
    the problem.

    View full-size slide

  45. 3-3. Slow logging: Slow MultiThread
    Serialization and emmit of
    log information are taking a
    long time
    Seems to be retrieving
    from Queue to
    deserialize
    If _enqueue is true (default),
    create a thread for SimpleQueue
    and log output
    multiprocessing.SimpleQueue
    is actually a pipe for communication between
    processes
    Data transfer to and from WriterThread seems to be slow

    View full-size slide

  46. 3-3. Slow logging: Slow MultiThread
    Countermeasure: Suppress logging and change
    libraries
    • Switched to fastlogging
    • I would like to handle relatively large logs, but this is still
    unresolved.

    View full-size slide

  47. • There are various processing systems other than CPython that we
    usually use.
    • Processes that compile to a lower level language, processes with
    JIT compiler, etc...
    • I tried two and selected PyPy this time.
    • PyPy
    • cinder
    • If the CPU is the bottleneck, overall performance improvement can
    be expected.
    3-4. Improving overall performance through
    changes in the processing system

    View full-size slide

  48. cinder
    Cinder is Instagram's internal performance-oriented production version of
    CPython 3.8.
    It contains a number of performance optimizations, including bytecode inline
    caching, eager evaluation of coroutines, …
    — cinder
    • Performance-improved CPython from Facebook
    • We saw a performance improvement of about +10% in our environment.
    • It is highly compatible with CPython and most of its libraries
    • However, you need to build it yourself, and I gave up on it because the
    GitHub is not very active, and there was no documentation, so it was
    difficult to manage.
    3-4. Improving overall performance through
    changes in the processing system

    View full-size slide

  49. PyPy
    A fast, compliant alternative implementation of Python
    On an average, PyPy is 4.2 times faster(!) than CPython
    — PyPy
    • JIT compiler/Incminimark GC
    • It is also quite compatible with CPython.
    • There have been no compatibility-related problems so far.
    • The performance benefits were so great that we now use it in
    production.
    • We confirmed an improvement in performance by nearly 40%
    3-4. Improving overall performance through
    changes in the processing system

    View full-size slide

  50. PyPy - This is not a silver bullet
    ✤ Challenges faced in replacement and operation
    • The latest version is 3.7.
    • Some libraries are not available.
    • OOM death due to omission of GC Option specification
    • Memory leak when combined with FastAPI?
    3-4. Improving overall performance through
    changes in the processing system

    View full-size slide

  51. Problems in PyPy (1): The latest version is 3.7
    ✤ Problem: 3.7 is the latest version as of 2021
    • 3.8 beta
    ✤ Solution
    • Since I was developing based on version 3.8, I downgraded
    some parts
    • Walrus operator :=
    • position-specific argument def huga(hoge, /, …)
    • I have not faced any real problems.
    3-4. Improving overall performance through
    changes in the processing system

    View full-size slide

  52. Problems in PyPy (2): Some libraries cannot be used
    ✤ Problem: PEP 517 Libraries that involve building external sources are
    almost always a problem.
    • E.g.
    • orjson … high-speed JSON serializer from rust 😭
    • dd-trace-py … for measuring DataDogAPM 😭
    • fastapi-profiler … the profiler that has been working wonders 😭
    • black(caused by typed-ast) … Linter/Formatter, used through pysen
    😭
    ✤ Solution
    • Use separate processing systems for development and deployment
    environments
    3-4. Improving overall performance through
    changes in the processing system

    View full-size slide

  53. To prevent bugs, such as the program working in CPython but not in
    PyPy,
    we implement e2e tests using production containers during CD
    For development: CPython3.7
    We also implemented UT on CI
    For production: PyPy3.7
    3-4. Improving overall performance through
    changes in the processing system
    Problems in PyPy (2): Some libraries cannot be used

    View full-size slide

  54. ✤ Problem: Memory increases to the maximum and dies with
    OOMKiller.
    • GC for PyPy is incminimark
    • GCOption must be specified
    ✤ Solution
    • Output GC with PYPY_GC_DEBUG=2
    • Restrict the heap limit with PYPY_GC_MAX
    3-4. Improving overall performance through
    changes in the processing system
    Problems in PyPy (2): Some libraries cannot be used

    View full-size slide

  55. Problems in PyPy (3): OOM death due to omission of GC
    Option specification
    Ensure MaxWorker x PYPY_GC_MAX
     
    At least specify
    PYPY_GC_MAX
    3-4. Improving overall performance through
    changes in the processing system

    View full-size slide

  56. Problems in PyPy (4): Memory leak when combined with
    FastAPI?
    ✤ Problem: Memory increases to the maximum and dies with
    OOMKiller.
    • Even on Echo servers, memory increases monotonically
    under load
    • I had not set KeepAlive on nginx...
    ✤ Solution
    • Resolved by connecting HTTP 1.1 (Default KeepAlive)
    • Set timeout-keep-alive on uvicorn also
    • I have not dug deep into this, but this seems to be a
    connection problem, so it may be more due to uvicorn than
    FastAPI.
    3-4. Improving overall performance through
    changes in the processing system

    View full-size slide

  57. 3. Problems Faced
    [Throughput]
    50rps
    [95%tile Latency]
    300ms
    [Throughput]
    250rps(500%⬆)
    [95%tile Latency]
    30ms(1000%⬆)
    Performance improvement per instance after handling all

    View full-size slide

  58. 1. Introduction
    2. Load Testing and Profiling
    3. Problems Faced and Countermeasures
    4. Summary
    Agenda

    View full-size slide

  59. Summary
    ✤ Load Testing and Profiling
    • Determine targets → Measure → Resolve bottlenecks
    • Useful tools: locust, datadog, fastapi-profiler, py-spy
    ✤ Problems and solutions
    • I/O bound: asyncio, cache
    • CPU bound: cache, Cython, numpy
    • Change of processing system: PyPy

    View full-size slide

  60. Thank You for Your
    Kind Attention!
    twitter: @martin_lover_se

    View full-size slide