Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High Performance FastAPI EN

High Performance FastAPI EN

PyCon JP 2021 presentation

Ikuo Suyama

October 15, 2021
Tweet

More Decks by Ikuo Suyama

Other Decks in Programming

Transcript

  1. 陶山 育男/Ikuo Suyama @martin_lover_se SmartNews, Inc. ✤ Ads Backend Engineer

    ✤ Internet ads specialist ✤ "Nothing beats ads, not even food " ✤ Usually use JVM, about one year of experience using Python ✤ Please go easy on me… 🙇
  2. Session Takeways ✤ Profiling methods and identification of bottlenecks ✤

    Specific challenges faced in our environment and countermeasures Python Web Application Detailed method of performance tuning
  3. About SmartNews 6 20 million people per month Number of

    users Used for about 16.7 minutes*2 per day Per person (Consolidated for Japan and US, as of August 2019) No. 1 News App User Base*1 - An everyday habit of consumers makes the largest user base in Japan *1. Source: Nielsen Mobile NetView as of January 2021 (Calculation of SmartNews App's user base based on number of installs of SmartNews App) *In-house figures, average for January 2021
  4. • Flask-like annotation-based routing API • Compatible with OpenAPI •

    Simple DI function • Easy to learn, development can begin quickly FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints.  - FastAPI Makes prompt service launch possible!
  5. Performance Tuning Mindset 1. Determine performance targets 2. Identify measurements

    and bottlenecks 3. Resolve bottlenecks Met performance targets? END 🎉 YES NO
  6. "Performance" of Web Applications You can go on and on

    if there is no target! Throughput How many requests can be handled in a short time?   ^(Because it's so much fun!) Latency How much time does it take to deal with one request? 2-1. Determine performance targets
  7. L … the average number of items in a queuing

    system λ … the average number of items arriving at the system per unit of time W … the average waiting time an item spends in a queuing system In terms of web application performance, this means... Little's theorem 2-1. Determine performance targets
  8. Example: Handled a 1,000 rps system with 50 cores   Supposing

    the use of ten 5 core instances (or Pods) Since it should be okay to handle 1000/10 = 100rps per 5core instance, set the performance target per instance as: Throughput 100rps (or more) , Latency 50ms (or less) Note: It may not always be possible to use the CPU 100%, I/O or concurrency also need to be considered and the number of CPU cores that can be used in K8s pod does not correspond with the limit, so the calculation is actually quite complex, and this is just a rough estimate. Method of determining performance targets: Example 2-1. Determine performance targets
  9. Rule 1: You can't tell where a program is going

    to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. Rule 2: Measure Don't tune for speed until you've measured,. and even then don't unless one part of the code overwhelms the rest. - Rob Pike: Notes on Programming in C - [Wikipedia] Pike: Notes on Programming in C (Japanese) 2-2.Identify measurements and bottlenecks and there is no point is tuning what is not a bottleneck. Do not guess, measure. Bottlenecks cannot be identified without measuring,
  10. 1. Determine the performance targets 2. Identify measurements and bottlenecks

    3. Resolve bottlenecks Met performance targets? END 🎉 YES NO Add load and check the load situation of the whole system Check the application's load situation Recreate the same conditions locally as much as possible, and measure while adding load Correct the part that is most likely to be the bottleneck  Is the bottleneck resolved? Has performance improved? Deploy in the load testing environment and measure Is the application the bottleneck?   Prepare the environment for load testing 2-2. Identify measurements and bottlenecks Correct until the bottleneck becomes part of the application YES NO YES NO Tools for measurement APM fastapi_profiler py-spy Speed up the feedback loop  
  11. • A benchmark load testing tool made with Python •

    Writes scenarios with Python API • Simple and adequate settings • Can be run in clusters for large-scale loads Locust is an easy to use, scriptable and scalable performance testing tool. - LOCUST 2-2. Identify measurements and bottlenecks Add load: LOCUST
  12. Add load: LOCUST Specify the number of simultaneous connections Run

    clusters of rps, median/95% tile latency Add load until latency worsens while increasing the number of simultaneous connections, and check the maximum throughput and 2-2. Identify measurements and bottlenecks
  13. • Check the resource situation of the whole system while

    load is added • Q: Has the application become the bottleneck? • There is no point in tuning the application until the application turns into the bottleneck. • Middleware, such as databases, usually become the bottleneck first. • Conduct tuning until the application throughput stops increasing even though resources, such as upstream services and connecting middleware, have capacity available. • Today, I will only be talking about tuning application.🙇 Check the load situation of the whole system: Datadog So, the rest of the presentation is based on the assumption that the application is the bottleneck. 2-2. Identify measurements and bottlenecks
  14. Check the number of requests for each service and the

    usage of CPU/memory, etc. Check the CPU usage of middleware and the changes in the number of connections Although detailed status can also be checked by attaching data to containers, this is very convenient for grasping the big picture 2-2. Identify measurements and bottlenecks
  15. • Easy to set up, sidecar and a few lines

    of launch code • Gets data from the real operating environment • Helps find bottlenecks as most of the necessary information is available • But, it is expensive 💸, and there is a need to find workarounds, such as applying only to specific instances... Application measurement (1): Datadog APM Datadog APM and Continuous Profiler provide detailed visibility into the application with standard performance dashboards for web services, queues, and databases to monitor requests, errors, and latency. — Datadog APM 2-2. Identify measurements and bottlenecks
  16. Application measurement (1): Datadog APM Latency histogram CPU time by

    scripts CPU time by function This is taking up maximum time and is likely to be a bottleneck 2-2. Identify measurements and bottlenecks
  17. Application measurement (2): fastapi_profiler A FastAPI middleware of pyinstrument to

    check your service code performance. — fastapi_profiler • Integrate pyinstrument as a FastAPI middleware • Useful for measuring CPU time while modifying code locally • Cannot be used in the actual application or under load because the impact on performance is high 2-2. Identify measurements and bottlenecks
  18. Application measurement (2): fastapi_profiler add_middleware resolves the problem The sorting

    method is slow = possible bottleneck 2-2. Identify measurements and bottlenecks
  19. • Rust profiler, extremely low overhead • Useful for quickly

    measuring CPU in a local/load environment • Also useful when GIL or multi-threaded processing could be the bottleneck • fastapi_profiler cannot measure threads that are not under Application measurement (3): py-spy It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. — py-spy 2-2. Identify measurements and bottlenecks
  20. This method takes up most of the endpoint processing time

    Application measurement (3): py-spy 2-2. Identify measurements and bottlenecks
  21. Configuration and load characteristics for deploying FastAPI Gunicorn supports working

    as a process manager and allowing users to tell it which specific worker process class to use. Then Gunicorn would start one or more worker processes using that class. And Uvicorn has a Gunicorn-compatible worker class. — Server Workers - Gunicorn with Uvicorn <Process Manager> <ASGI Server> <ASGI Framework> <Web Framework> Launch the process and move forks and sockets to workers Request/response Asynchronous processing Request Response Func call callback Extends <Application> Worker Applications Business logic
  22. Configuration and load characteristics for deploying FastAPI <Process Manager> <ASGI

    Server> <ASGI Framework> <Web Framework> Request Response Func call callback Extends <Application> Worker there is no impact on performance because it only deals with process management   It is not faster than Starlette, but... Fastest Python ASGI Items below Starlette did not pose a bottleneck this time Gunicorn supports working as a process manager and allowing users to tell it which specific worker process class to use. Then Gunicorn would start one or more worker processes using that class. And Uvicorn has a Gunicorn-compatible worker class. — Server Workers - Gunicorn with Uvicorn
  23. 1. Slow I/O bound processing 2. Slow CPU bound processing

    3. Slow logging: Slow MultiThread 4. Improving overall performance through changes in the processing system 3. Problems Faced
  24. 3-1. Slow I/O bound processing ✤ Symptoms • Throughput does

    not increase when CPU is not used up. • Check CPU us with top, vmstat, etc. ✤ Causes • Processing that requires network I/O • E.g.: access into DB or other middleware • File writing • etc., etc.
  25. • Use async/await where the network is involved • Many

    compatible libraries • E.g.: aioredis/httpx … 3-1. Slow I/O bound processing Countermeasure (1): Use acyncio asyncio is a library for writing codes for synchronous processing using the async/await syntax . — asyncio
  26. 3-1. Slow I/O bound processing Countermeasure (1): Use acyncio I

    set this as default for anything involving the network, so there was hardly any problems in I/O On FastAPI, simple set routing method to async Use await in async method
  27. • Still slow wherever the network was involved • It

    takes almost up tp 1000 times more time when compared to the memory reference • Cache retrieved results in application memory if result consistency tolerance is high 3-1. Slow I/O bound processing Countermeasure (2): Cache Latency Comparison Numbers (~2012) ---------------------------------- : Main memory reference 100 ns : Round trip within same datacenter 500,000 ns 500 us — Latency Numbers Every Programmer Should Know
  28. 3-1. Slow I/O bound processing Countermeasure (2): Cache Simple LRUCache

    using OrderedDict Redic access only when there is no cache Uses cache for each method lru_cache annotation is also useful
  29. ✤ Symptoms • user time shown in top, vmstat, etc.,

    high LA • Throttle in k8s environment ✤ Main causes • Simple heavy calculations such as floating point calculations • Inefficient algorithms • etc., etc. 3-2. Slow CPU bound processing Now, let's focus on application tuning!
  30. • E.g.: URL encoding processing (urllib/parse.py) is slow Countermeasure (1):

    Use prior calculation/cache Either calculate offline in advance, or bypass heavy calculations by caching calculation results urllib/parse.py takes up a lot of time Cache Url calculation results   3-2. Slow CPU bound processing
  31. • Measures for items that must be calculated online for

    each request and have a high computational load • E.g.: calculating geohash for requested location information • Pre-compile to a low-level language such as C • Can be partially applied, such as per function Countermeasure (2): Cython 3-2. Slow CPU bound processing Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language — Cython
  32. Countermeasure (2): Cython 3-2. Slow CPU bound processing Formula to

    calculate geohash(maptile) from specified location information The caller that compiles by writing into .pyx is the same as usual Python Prepare setup.py like this Compiles during docker build
  33. • Loading from CPU Cache is much faster than RAM

    • Optimize the use of CPU cache by using numpy vector calculation for loop processing • E.g.: calculating the distance to each coupon from the requested location information. Countermeasure (3): numpy 3-2. Slow CPU bound processing Latency Comparison Numbers (~2012) ---------------------------------- L1 cache reference 0.5 ns L2 cache reference 7 ns 14x L1 cache : Main memory reference 100 ns 20x L2 cache, 200x L1 cache — Latency Numbers Every Programmer Should Know
  34. Countermeasure (3): numpy 3-2. Slow CPU bound processing Turn into

    vectors and use vector calculation for future requests We want to calculating online the location information of the request and its distance from each coupon  
  35. 3-3. Slow logging: Slow MultiThread ✤ Symptoms • Uses CPU

    for threads other than FastAPI • Check with Datadog • GIL values are low in py-spy top ✤ Causes • A library called logru was writing logs in MultiThread, and that was taking a long time • Python can handle threads, but it is not efficient because of GIL
  36. 3-3. Slow logging: Slow MultiThread In CPython, the global interpreter

    lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. — Global Interpreter Lock GIL / Global Interpreter Lock • Even if you set up threads, only one thread will be run at a time in the process • In Python, parallel processing is handled by Multi-Processing. Combining that with parallel processing with asyncio can solve the problem.
  37. 3-3. Slow logging: Slow MultiThread Serialization and emmit of log

    information are taking a long time Seems to be retrieving from Queue to deserialize If _enqueue is true (default), create a thread for SimpleQueue and log output multiprocessing.SimpleQueue is actually a pipe for communication between processes Data transfer to and from WriterThread seems to be slow
  38. 3-3. Slow logging: Slow MultiThread Countermeasure: Suppress logging and change

    libraries • Switched to fastlogging • I would like to handle relatively large logs, but this is still unresolved.
  39. • There are various processing systems other than CPython that

    we usually use. • Processes that compile to a lower level language, processes with JIT compiler, etc... • I tried two and selected PyPy this time. • PyPy • cinder • If the CPU is the bottleneck, overall performance improvement can be expected. 3-4. Improving overall performance through changes in the processing system
  40. cinder Cinder is Instagram's internal performance-oriented production version of CPython

    3.8. It contains a number of performance optimizations, including bytecode inline caching, eager evaluation of coroutines, … — cinder • Performance-improved CPython from Facebook • We saw a performance improvement of about +10% in our environment. • It is highly compatible with CPython and most of its libraries • However, you need to build it yourself, and I gave up on it because the GitHub is not very active, and there was no documentation, so it was difficult to manage. 3-4. Improving overall performance through changes in the processing system
  41. PyPy A fast, compliant alternative implementation of Python On an

    average, PyPy is 4.2 times faster(!) than CPython — PyPy • JIT compiler/Incminimark GC • It is also quite compatible with CPython. • There have been no compatibility-related problems so far. • The performance benefits were so great that we now use it in production. • We confirmed an improvement in performance by nearly 40% 3-4. Improving overall performance through changes in the processing system
  42. PyPy - This is not a silver bullet ✤ Challenges

    faced in replacement and operation • The latest version is 3.7. • Some libraries are not available. • OOM death due to omission of GC Option specification • Memory leak when combined with FastAPI? 3-4. Improving overall performance through changes in the processing system
  43. Problems in PyPy (1): The latest version is 3.7 ✤

    Problem: 3.7 is the latest version as of 2021 • 3.8 beta ✤ Solution • Since I was developing based on version 3.8, I downgraded some parts • Walrus operator := • position-specific argument def huga(hoge, /, …) • I have not faced any real problems. 3-4. Improving overall performance through changes in the processing system
  44. Problems in PyPy (2): Some libraries cannot be used ✤

    Problem: PEP 517 Libraries that involve building external sources are almost always a problem. • E.g. • orjson … high-speed JSON serializer from rust 😭 • dd-trace-py … for measuring DataDogAPM 😭 • fastapi-profiler … the profiler that has been working wonders 😭 • black(caused by typed-ast) … Linter/Formatter, used through pysen 😭 ✤ Solution • Use separate processing systems for development and deployment environments 3-4. Improving overall performance through changes in the processing system
  45. To prevent bugs, such as the program working in CPython

    but not in PyPy, we implement e2e tests using production containers during CD For development: CPython3.7 We also implemented UT on CI For production: PyPy3.7 3-4. Improving overall performance through changes in the processing system Problems in PyPy (2): Some libraries cannot be used
  46. ✤ Problem: Memory increases to the maximum and dies with

    OOMKiller. • GC for PyPy is incminimark • GCOption must be specified ✤ Solution • Output GC with PYPY_GC_DEBUG=2 • Restrict the heap limit with PYPY_GC_MAX 3-4. Improving overall performance through changes in the processing system Problems in PyPy (2): Some libraries cannot be used
  47. Problems in PyPy (3): OOM death due to omission of

    GC Option specification Ensure MaxWorker x PYPY_GC_MAX   At least specify PYPY_GC_MAX 3-4. Improving overall performance through changes in the processing system
  48. Problems in PyPy (4): Memory leak when combined with FastAPI?

    ✤ Problem: Memory increases to the maximum and dies with OOMKiller. • Even on Echo servers, memory increases monotonically under load • I had not set KeepAlive on nginx... ✤ Solution • Resolved by connecting HTTP 1.1 (Default KeepAlive) • Set timeout-keep-alive on uvicorn also • I have not dug deep into this, but this seems to be a connection problem, so it may be more due to uvicorn than FastAPI. 3-4. Improving overall performance through changes in the processing system
  49. 3. Problems Faced [Throughput] 50rps [95%tile Latency] 300ms [Throughput] 250rps(500%⬆)

    [95%tile Latency] 30ms(1000%⬆) Performance improvement per instance after handling all
  50. Summary ✤ Load Testing and Profiling • Determine targets →

    Measure → Resolve bottlenecks • Useful tools: locust, datadog, fastapi-profiler, py-spy ✤ Problems and solutions • I/O bound: asyncio, cache • CPU bound: cache, Cython, numpy • Change of processing system: PyPy