High Performance FastAPI EN

High Performance FastAPI Ikuo Suyama

陶山育男/Ikuo Suyama @martin_lover_se SmartNews, Inc. ✤ Ads Backend Engineer
✤ Internet ads specialist ✤ "Nothing beats ads, not even food " ✤ Usually use JVM, about one year of experience using Python ✤ Please go easy on me… 🙇

Session Takeways ✤ Profiling methods and identification of bottlenecks ✤
Specific challenges faced in our environment and countermeasures Python Web Application Detailed method of performance tuning

1. Introduction 2. Load Testing and Profiling 3. Problems Faced
and Countermeasures 4. Summary Agenda

About SmartNews 6 20 million people per month Number of
users Used for about 16.7 minutes*2 per day Per person (Consolidated for Japan and US, as of August 2019) No. 1 News App User Base*1 - An everyday habit of consumers makes the largest user base in Japan *1. Source: Nielsen Mobile NetView as of January 2021 (Calculation of SmartNews App's user base based on number of installs of SmartNews App) *In-house figures, average for January 2021

Python in SmartNews - Coupon Channel news-service coupon-service api-gateway coupon-admin
Redis RDB Adopting FastAPI for web servers

• Flask-like annotation-based routing API • Compatible with OpenAPI •
Simple DI function • Easy to learn, development can begin quickly FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints.　 - FastAPI Makes prompt service launch possible!

FastAPI - Seeing is believing

Full Rollout… news-service coupon-service api-gateway coupon-admin Redis RDB SmartNews power!!
🔥 Performance tuning

Performance Tuning Mindset 1. Determine performance targets 2. Identify measurements
and bottlenecks 3. Resolve bottlenecks Met performance targets? END 🎉 YES NO

"Performance" of Web Applications You can go on and on
if there is no target! Throughput How many requests can be handled in a short time? 　 ^(Because it's so much fun!) Latency How much time does it take to deal with one request? 2-1. Determine performance targets

L … the average number of items in a queuing
system λ … the average number of items arriving at the system per unit of time W … the average waiting time an item spends in a queuing system In terms of web application performance, this means... Little's theorem 2-1. Determine performance targets

Example: Handled a 1,000 rps system with 50 cores 　　Supposing
the use of ten 5 core instances (or Pods) Since it should be okay to handle 1000/10 = 100rps per 5core instance, set the performance target per instance as: Throughput 100rps (or more) , Latency 50ms (or less) Note: It may not always be possible to use the CPU 100%, I/O or concurrency also need to be considered and the number of CPU cores that can be used in K8s pod does not correspond with the limit, so the calculation is actually quite complex, and this is just a rough estimate. Method of determining performance targets: Example 2-1. Determine performance targets

Rule 1: You can't tell where a program is going
to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. Rule 2: Measure Don't tune for speed until you've measured,. and even then don't unless one part of the code overwhelms the rest. - Rob Pike: Notes on Programming in C - [Wikipedia] Pike: Notes on Programming in C (Japanese) 2-2.Identify measurements and bottlenecks and there is no point is tuning what is not a bottleneck. Do not guess, measure. Bottlenecks cannot be identified without measuring,

1. Determine the performance targets 2. Identify measurements and bottlenecks
3. Resolve bottlenecks Met performance targets? END 🎉 YES NO Add load and check the load situation of the whole system Check the application's load situation Recreate the same conditions locally as much as possible, and measure while adding load Correct the part that is most likely to be the bottleneck　 Is the bottleneck resolved? Has performance improved? Deploy in the load testing environment and measure Is the application the bottleneck? 　 Prepare the environment for load testing 2-2． Identify measurements and bottlenecks Correct until the bottleneck becomes part of the application YES NO YES NO Tools for measurement APM fastapi_profiler py-spy Speed up the feedback loop 　

• A benchmark load testing tool made with Python •
Writes scenarios with Python API • Simple and adequate settings • Can be run in clusters for large-scale loads Locust is an easy to use, scriptable and scalable performance testing tool. - LOCUST 2-2. Identify measurements and bottlenecks Add load: LOCUST

Add load: LOCUST Specify the number of simultaneous connections Run
clusters of rps, median/95% tile latency Add load until latency worsens while increasing the number of simultaneous connections, and check the maximum throughput and 2-2. Identify measurements and bottlenecks

• Check the resource situation of the whole system while
load is added • Q: Has the application become the bottleneck? • There is no point in tuning the application until the application turns into the bottleneck. • Middleware, such as databases, usually become the bottleneck first. • Conduct tuning until the application throughput stops increasing even though resources, such as upstream services and connecting middleware, have capacity available. • Today, I will only be talking about tuning application.🙇 Check the load situation of the whole system: Datadog So, the rest of the presentation is based on the assumption that the application is the bottleneck. 2-2. Identify measurements and bottlenecks

Check the number of requests for each service and the
usage of CPU/memory, etc. Check the CPU usage of middleware and the changes in the number of connections Although detailed status can also be checked by attaching data to containers, this is very convenient for grasping the big picture 2-2. Identify measurements and bottlenecks

• Easy to set up, sidecar and a few lines
of launch code • Gets data from the real operating environment • Helps find bottlenecks as most of the necessary information is available • But, it is expensive 💸, and there is a need to find workarounds, such as applying only to specific instances... Application measurement (1): Datadog APM Datadog APM and Continuous Profiler provide detailed visibility into the application with standard performance dashboards for web services, queues, and databases to monitor requests, errors, and latency. — Datadog APM 2-2. Identify measurements and bottlenecks

Application measurement (1): Datadog APM Latency histogram CPU time by
scripts CPU time by function This is taking up maximum time and is likely to be a bottleneck 2-2. Identify measurements and bottlenecks

Application measurement (2): fastapi_profiler A FastAPI middleware of pyinstrument to
check your service code performance. — fastapi_profiler • Integrate pyinstrument as a FastAPI middleware • Useful for measuring CPU time while modifying code locally • Cannot be used in the actual application or under load because the impact on performance is high 2-2. Identify measurements and bottlenecks

Application measurement (2): fastapi_profiler add_middleware resolves the problem The sorting
method is slow = possible bottleneck 2-2. Identify measurements and bottlenecks

• Rust profiler, extremely low overhead • Useful for quickly
measuring CPU in a local/load environment • Also useful when GIL or multi-threaded processing could be the bottleneck • fastapi_profiler cannot measure threads that are not under Application measurement (3): py-spy It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. — py-spy 2-2. Identify measurements and bottlenecks

This method takes up most of the endpoint processing time
Application measurement (3): py-spy 2-2. Identify measurements and bottlenecks

Configuration and load characteristics for deploying FastAPI Gunicorn supports working
as a process manager and allowing users to tell it which specific worker process class to use. Then Gunicorn would start one or more worker processes using that class. And Uvicorn has a Gunicorn-compatible worker class. — Server Workers - Gunicorn with Uvicorn <Process Manager> <ASGI Server> <ASGI Framework> <Web Framework> Launch the process and move forks and sockets to workers Request/response Asynchronous processing Request Response Func call callback Extends <Application> Worker Applications Business logic

Configuration and load characteristics for deploying FastAPI <Process Manager> <ASGI
Server> <ASGI Framework> <Web Framework> Request Response Func call callback Extends <Application> Worker there is no impact on performance because it only deals with process management 　 It is not faster than Starlette, but... Fastest Python ASGI Items below Starlette did not pose a bottleneck this time Gunicorn supports working as a process manager and allowing users to tell it which specific worker process class to use. Then Gunicorn would start one or more worker processes using that class. And Uvicorn has a Gunicorn-compatible worker class. — Server Workers - Gunicorn with Uvicorn

1. Slow I/O bound processing 2. Slow CPU bound processing
3. Slow logging: Slow MultiThread 4. Improving overall performance through changes in the processing system 3. Problems Faced

3-1. Slow I/O bound processing ✤ Symptoms • Throughput does
not increase when CPU is not used up. • Check CPU us with top, vmstat, etc. ✤ Causes • Processing that requires network I/O • E.g.: access into DB or other middleware • File writing • etc., etc.

• Use async/await where the network is involved • Many
compatible libraries • E.g.: aioredis/httpx … 3-1. Slow I/O bound processing Countermeasure (1): Use acyncio asyncio is a library for writing codes for synchronous processing using the async/await syntax . — asyncio

3-1. Slow I/O bound processing Countermeasure (1): Use acyncio I
set this as default for anything involving the network, so there was hardly any problems in I/O On FastAPI, simple set routing method to async Use await in async method

• Still slow wherever the network was involved • It
takes almost up tp 1000 times more time when compared to the memory reference • Cache retrieved results in application memory if result consistency tolerance is high 3-1. Slow I/O bound processing Countermeasure (2): Cache Latency Comparison Numbers (~2012) ---------------------------------- : Main memory reference 100 ns : Round trip within same datacenter 500,000 ns 500 us — Latency Numbers Every Programmer Should Know

3-1. Slow I/O bound processing Countermeasure (2): Cache Simple LRUCache
using OrderedDict Redic access only when there is no cache Uses cache for each method lru_cache annotation is also useful

✤ Symptoms • user time shown in top, vmstat, etc.,
high LA • Throttle in k8s environment ✤ Main causes • Simple heavy calculations such as floating point calculations • Inefficient algorithms • etc., etc. 3-2. Slow CPU bound processing Now, let's focus on application tuning!

• E.g.: URL encoding processing (urllib/parse.py) is slow Countermeasure (1):
Use prior calculation/cache Either calculate offline in advance, or bypass heavy calculations by caching calculation results urllib/parse.py takes up a lot of time Cache Url calculation results 　 3-2. Slow CPU bound processing

• Measures for items that must be calculated online for
each request and have a high computational load • E.g.: calculating geohash for requested location information • Pre-compile to a low-level language such as C • Can be partially applied, such as per function Countermeasure (2): Cython 3-2. Slow CPU bound processing Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language — Cython

Countermeasure (2): Cython 3-2. Slow CPU bound processing Formula to
calculate geohash(maptile) from specified location information The caller that compiles by writing into .pyx is the same as usual Python Prepare setup.py like this Compiles during docker build

• Loading from CPU Cache is much faster than RAM
• Optimize the use of CPU cache by using numpy vector calculation for loop processing • E.g.: calculating the distance to each coupon from the requested location information. Countermeasure (3): numpy 3-2. Slow CPU bound processing Latency Comparison Numbers (~2012) ---------------------------------- L1 cache reference 0.5 ns L2 cache reference 7 ns 14x L1 cache : Main memory reference 100 ns 20x L2 cache, 200x L1 cache — Latency Numbers Every Programmer Should Know

Countermeasure (3): numpy 3-2. Slow CPU bound processing Turn into
vectors and use vector calculation for future requests We want to calculating online the location information of the request and its distance from each coupon 　

3-3. Slow logging: Slow MultiThread ✤ Symptoms • Uses CPU
for threads other than FastAPI • Check with Datadog • GIL values are low in py-spy top ✤ Causes • A library called logru was writing logs in MultiThread, and that was taking a long time • Python can handle threads, but it is not efficient because of GIL

3-3. Slow logging: Slow MultiThread In CPython, the global interpreter
lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. — Global Interpreter Lock GIL / Global Interpreter Lock • Even if you set up threads, only one thread will be run at a time in the process • In Python, parallel processing is handled by Multi-Processing. Combining that with parallel processing with asyncio can solve the problem.

3-3. Slow logging: Slow MultiThread Serialization and emmit of log
information are taking a long time Seems to be retrieving from Queue to deserialize If _enqueue is true (default), create a thread for SimpleQueue and log output multiprocessing.SimpleQueue is actually a pipe for communication between processes Data transfer to and from WriterThread seems to be slow

3-3. Slow logging: Slow MultiThread Countermeasure: Suppress logging and change
libraries • Switched to fastlogging • I would like to handle relatively large logs, but this is still unresolved.

• There are various processing systems other than CPython that
we usually use. • Processes that compile to a lower level language, processes with JIT compiler, etc... • I tried two and selected PyPy this time. • PyPy • cinder • If the CPU is the bottleneck, overall performance improvement can be expected. 3-4. Improving overall performance through changes in the processing system

cinder Cinder is Instagram's internal performance-oriented production version of CPython
3.8. It contains a number of performance optimizations, including bytecode inline caching, eager evaluation of coroutines, … — cinder • Performance-improved CPython from Facebook • We saw a performance improvement of about +10% in our environment. • It is highly compatible with CPython and most of its libraries • However, you need to build it yourself, and I gave up on it because the GitHub is not very active, and there was no documentation, so it was difficult to manage. 3-4. Improving overall performance through changes in the processing system

PyPy A fast, compliant alternative implementation of Python On an
average, PyPy is 4.2 times faster(!) than CPython — PyPy • JIT compiler/Incminimark GC • It is also quite compatible with CPython. • There have been no compatibility-related problems so far. • The performance benefits were so great that we now use it in production. • We confirmed an improvement in performance by nearly 40% 3-4. Improving overall performance through changes in the processing system

PyPy - This is not a silver bullet ✤ Challenges
faced in replacement and operation • The latest version is 3.7. • Some libraries are not available. • OOM death due to omission of GC Option specification • Memory leak when combined with FastAPI? 3-4. Improving overall performance through changes in the processing system

Problems in PyPy (1): The latest version is 3.7 ✤
Problem: 3.7 is the latest version as of 2021 • 3.8 beta ✤ Solution • Since I was developing based on version 3.8, I downgraded some parts • Walrus operator := • position-specific argument def huga(hoge, /, …) • I have not faced any real problems. 3-4. Improving overall performance through changes in the processing system

Problems in PyPy (2): Some libraries cannot be used ✤
Problem: PEP 517 Libraries that involve building external sources are almost always a problem. • E.g. • orjson … high-speed JSON serializer from rust 😭 • dd-trace-py … for measuring DataDogAPM 😭 • fastapi-profiler … the profiler that has been working wonders 😭 • black(caused by typed-ast) … Linter/Formatter, used through pysen 😭 ✤ Solution • Use separate processing systems for development and deployment environments 3-4. Improving overall performance through changes in the processing system

To prevent bugs, such as the program working in CPython
but not in PyPy, we implement e2e tests using production containers during CD For development: CPython3.7 We also implemented UT on CI For production: PyPy3.7 3-4. Improving overall performance through changes in the processing system Problems in PyPy (2): Some libraries cannot be used

✤ Problem: Memory increases to the maximum and dies with
OOMKiller. • GC for PyPy is incminimark • GCOption must be specified ✤ Solution • Output GC with PYPY_GC_DEBUG=2 • Restrict the heap limit with PYPY_GC_MAX 3-4. Improving overall performance through changes in the processing system Problems in PyPy (2): Some libraries cannot be used

Problems in PyPy (3): OOM death due to omission of
GC Option specification Ensure MaxWorker x PYPY_GC_MAX 　 At least specify PYPY_GC_MAX 3-4. Improving overall performance through changes in the processing system

Problems in PyPy (4): Memory leak when combined with FastAPI?
✤ Problem: Memory increases to the maximum and dies with OOMKiller. • Even on Echo servers, memory increases monotonically under load • I had not set KeepAlive on nginx... ✤ Solution • Resolved by connecting HTTP 1.1 (Default KeepAlive) • Set timeout-keep-alive on uvicorn also • I have not dug deep into this, but this seems to be a connection problem, so it may be more due to uvicorn than FastAPI. 3-4. Improving overall performance through changes in the processing system

3. Problems Faced [Throughput] 50rps [95%tile Latency] 300ms [Throughput] 250rps(500%⬆)
[95%tile Latency] 30ms(1000%⬆) Performance improvement per instance after handling all

Summary ✤ Load Testing and Profiling • Determine targets →
Measure → Resolve bottlenecks • Useful tools: locust, datadog, fastapi-profiler, py-spy ✤ Problems and solutions • I/O bound: asyncio, cache • CPU bound: cache, Cython, numpy • Change of processing system: PyPy

Thank You for Your Kind Attention! twitter: @martin_lover_se

High Performance FastAPI EN

High Performance FastAPI EN

More Decks by Ikuo Suyama

Other Decks in Programming

Featured

Transcript