Slide 1

Slide 1 text

High Performance FastAPI Ikuo Suyama

Slide 2

Slide 2 text

陶山 育男/Ikuo Suyama @martin_lover_se SmartNews, Inc. ✤ Ads Backend Engineer ✤ Internet ads specialist ✤ "Nothing beats ads, not even food " ✤ Usually use JVM, about one year of experience using Python ✤ Please go easy on me… 🙇

Slide 3

Slide 3 text

Session Takeways ✤ Profiling methods and identification of bottlenecks ✤ Specific challenges faced in our environment and countermeasures Python Web Application Detailed method of performance tuning

Slide 4

Slide 4 text

1. Introduction 2. Load Testing and Profiling 3. Problems Faced and Countermeasures 4. Summary Agenda

Slide 5

Slide 5 text

1. Introduction 2. Load Testing and Profiling 3. Problems Faced and Countermeasures 4. Summary Agenda

Slide 6

Slide 6 text

About SmartNews 6 20 million people per month Number of users Used for about 16.7 minutes*2 per day Per person (Consolidated for Japan and US, as of August 2019) No. 1 News App User Base*1 - An everyday habit of consumers makes the largest user base in Japan *1. Source: Nielsen Mobile NetView as of January 2021 (Calculation of SmartNews App's user base based on number of installs of SmartNews App) *In-house figures, average for January 2021

Slide 7

Slide 7 text

Python in SmartNews - Coupon Channel news-service coupon-service api-gateway coupon-admin Redis RDB Adopting FastAPI for web servers

Slide 8

Slide 8 text

• Flask-like annotation-based routing API • Compatible with OpenAPI • Simple DI function • Easy to learn, development can begin quickly FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints.  - FastAPI Makes prompt service launch possible!

Slide 9

Slide 9 text

FastAPI - Seeing is believing

Slide 10

Slide 10 text

Full Rollout… news-service coupon-service api-gateway coupon-admin Redis RDB SmartNews power!! 🔥 Performance tuning

Slide 11

Slide 11 text

1. Introduction 2. Load Testing and Profiling 3. Problems Faced and Countermeasures 4. Summary Agenda

Slide 12

Slide 12 text

Performance Tuning Mindset 1. Determine performance targets 2. Identify measurements and bottlenecks 3. Resolve bottlenecks Met performance targets? END 🎉 YES NO

Slide 13

Slide 13 text

"Performance" of Web Applications You can go on and on if there is no target! Throughput How many requests can be handled in a short time?   ^(Because it's so much fun!) Latency How much time does it take to deal with one request? 2-1. Determine performance targets

Slide 14

Slide 14 text

L … the average number of items in a queuing system λ … the average number of items arriving at the system per unit of time W … the average waiting time an item spends in a queuing system In terms of web application performance, this means... Little's theorem 2-1. Determine performance targets

Slide 15

Slide 15 text

Example: Handled a 1,000 rps system with 50 cores   Supposing the use of ten 5 core instances (or Pods) Since it should be okay to handle 1000/10 = 100rps per 5core instance, set the performance target per instance as: Throughput 100rps (or more) , Latency 50ms (or less) Note: It may not always be possible to use the CPU 100%, I/O or concurrency also need to be considered and the number of CPU cores that can be used in K8s pod does not correspond with the limit, so the calculation is actually quite complex, and this is just a rough estimate. Method of determining performance targets: Example 2-1. Determine performance targets

Slide 16

Slide 16 text

Rule 1: You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. Rule 2: Measure Don't tune for speed until you've measured,. and even then don't unless one part of the code overwhelms the rest. - Rob Pike: Notes on Programming in C - [Wikipedia] Pike: Notes on Programming in C (Japanese) 2-2.Identify measurements and bottlenecks and there is no point is tuning what is not a bottleneck. Do not guess, measure. Bottlenecks cannot be identified without measuring,

Slide 17

Slide 17 text

1. Determine the performance targets 2. Identify measurements and bottlenecks 3. Resolve bottlenecks Met performance targets? END 🎉 YES NO Add load and check the load situation of the whole system Check the application's load situation Recreate the same conditions locally as much as possible, and measure while adding load Correct the part that is most likely to be the bottleneck  Is the bottleneck resolved? Has performance improved? Deploy in the load testing environment and measure Is the application the bottleneck?   Prepare the environment for load testing 2-2. Identify measurements and bottlenecks Correct until the bottleneck becomes part of the application YES NO YES NO Tools for measurement APM fastapi_profiler py-spy Speed up the feedback loop  

Slide 18

Slide 18 text

• A benchmark load testing tool made with Python • Writes scenarios with Python API • Simple and adequate settings • Can be run in clusters for large-scale loads Locust is an easy to use, scriptable and scalable performance testing tool. - LOCUST 2-2. Identify measurements and bottlenecks Add load: LOCUST

Slide 19

Slide 19 text

Add load: LOCUST Specify the number of simultaneous connections Run clusters of rps, median/95% tile latency Add load until latency worsens while increasing the number of simultaneous connections, and check the maximum throughput and 2-2. Identify measurements and bottlenecks

Slide 20

Slide 20 text

• Check the resource situation of the whole system while load is added • Q: Has the application become the bottleneck? • There is no point in tuning the application until the application turns into the bottleneck. • Middleware, such as databases, usually become the bottleneck first. • Conduct tuning until the application throughput stops increasing even though resources, such as upstream services and connecting middleware, have capacity available. • Today, I will only be talking about tuning application.🙇 Check the load situation of the whole system: Datadog So, the rest of the presentation is based on the assumption that the application is the bottleneck. 2-2. Identify measurements and bottlenecks

Slide 21

Slide 21 text

Check the number of requests for each service and the usage of CPU/memory, etc. Check the CPU usage of middleware and the changes in the number of connections Although detailed status can also be checked by attaching data to containers, this is very convenient for grasping the big picture 2-2. Identify measurements and bottlenecks

Slide 22

Slide 22 text

• Easy to set up, sidecar and a few lines of launch code • Gets data from the real operating environment • Helps find bottlenecks as most of the necessary information is available • But, it is expensive 💸, and there is a need to find workarounds, such as applying only to specific instances... Application measurement (1): Datadog APM Datadog APM and Continuous Profiler provide detailed visibility into the application with standard performance dashboards for web services, queues, and databases to monitor requests, errors, and latency. — Datadog APM 2-2. Identify measurements and bottlenecks

Slide 23

Slide 23 text

Application measurement (1): Datadog APM Latency histogram CPU time by scripts CPU time by function This is taking up maximum time and is likely to be a bottleneck 2-2. Identify measurements and bottlenecks

Slide 24

Slide 24 text

Application measurement (2): fastapi_profiler A FastAPI middleware of pyinstrument to check your service code performance. — fastapi_profiler • Integrate pyinstrument as a FastAPI middleware • Useful for measuring CPU time while modifying code locally • Cannot be used in the actual application or under load because the impact on performance is high 2-2. Identify measurements and bottlenecks

Slide 25

Slide 25 text

Application measurement (2): fastapi_profiler add_middleware resolves the problem The sorting method is slow = possible bottleneck 2-2. Identify measurements and bottlenecks

Slide 26

Slide 26 text

• Rust profiler, extremely low overhead • Useful for quickly measuring CPU in a local/load environment • Also useful when GIL or multi-threaded processing could be the bottleneck • fastapi_profiler cannot measure threads that are not under Application measurement (3): py-spy It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. — py-spy 2-2. Identify measurements and bottlenecks

Slide 27

Slide 27 text

This method takes up most of the endpoint processing time Application measurement (3): py-spy 2-2. Identify measurements and bottlenecks

Slide 28

Slide 28 text

1. Introduction 2. Load Testing and Profiling 3. Problems Faced and Countermeasures 4. Summary Agenda

Slide 29

Slide 29 text

Configuration and load characteristics for deploying FastAPI Gunicorn supports working as a process manager and allowing users to tell it which specific worker process class to use. Then Gunicorn would start one or more worker processes using that class. And Uvicorn has a Gunicorn-compatible worker class. — Server Workers - Gunicorn with Uvicorn Launch the process and move forks and sockets to workers Request/response Asynchronous processing Request Response Func call callback Extends Worker Applications Business logic

Slide 30

Slide 30 text

Configuration and load characteristics for deploying FastAPI Request Response Func call callback Extends Worker there is no impact on performance because it only deals with process management   It is not faster than Starlette, but... Fastest Python ASGI Items below Starlette did not pose a bottleneck this time Gunicorn supports working as a process manager and allowing users to tell it which specific worker process class to use. Then Gunicorn would start one or more worker processes using that class. And Uvicorn has a Gunicorn-compatible worker class. — Server Workers - Gunicorn with Uvicorn

Slide 31

Slide 31 text

1. Slow I/O bound processing 2. Slow CPU bound processing 3. Slow logging: Slow MultiThread 4. Improving overall performance through changes in the processing system 3. Problems Faced

Slide 32

Slide 32 text

3-1. Slow I/O bound processing ✤ Symptoms • Throughput does not increase when CPU is not used up. • Check CPU us with top, vmstat, etc. ✤ Causes • Processing that requires network I/O • E.g.: access into DB or other middleware • File writing • etc., etc.

Slide 33

Slide 33 text

• Use async/await where the network is involved • Many compatible libraries • E.g.: aioredis/httpx … 3-1. Slow I/O bound processing Countermeasure (1): Use acyncio asyncio is a library for writing codes for synchronous processing using the async/await syntax . — asyncio

Slide 34

Slide 34 text

3-1. Slow I/O bound processing Countermeasure (1): Use acyncio I set this as default for anything involving the network, so there was hardly any problems in I/O On FastAPI, simple set routing method to async Use await in async method

Slide 35

Slide 35 text

• Still slow wherever the network was involved • It takes almost up tp 1000 times more time when compared to the memory reference • Cache retrieved results in application memory if result consistency tolerance is high 3-1. Slow I/O bound processing Countermeasure (2): Cache Latency Comparison Numbers (~2012) ---------------------------------- : Main memory reference 100 ns : Round trip within same datacenter 500,000 ns 500 us — Latency Numbers Every Programmer Should Know

Slide 36

Slide 36 text

3-1. Slow I/O bound processing Countermeasure (2): Cache Simple LRUCache using OrderedDict Redic access only when there is no cache Uses cache for each method lru_cache annotation is also useful

Slide 37

Slide 37 text

✤ Symptoms • user time shown in top, vmstat, etc., high LA • Throttle in k8s environment ✤ Main causes • Simple heavy calculations such as floating point calculations • Inefficient algorithms • etc., etc. 3-2. Slow CPU bound processing Now, let's focus on application tuning!

Slide 38

Slide 38 text

• E.g.: URL encoding processing (urllib/parse.py) is slow Countermeasure (1): Use prior calculation/cache Either calculate offline in advance, or bypass heavy calculations by caching calculation results urllib/parse.py takes up a lot of time Cache Url calculation results   3-2. Slow CPU bound processing

Slide 39

Slide 39 text

• Measures for items that must be calculated online for each request and have a high computational load • E.g.: calculating geohash for requested location information • Pre-compile to a low-level language such as C • Can be partially applied, such as per function Countermeasure (2): Cython 3-2. Slow CPU bound processing Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language — Cython

Slide 40

Slide 40 text

Countermeasure (2): Cython 3-2. Slow CPU bound processing Formula to calculate geohash(maptile) from specified location information The caller that compiles by writing into .pyx is the same as usual Python Prepare setup.py like this Compiles during docker build

Slide 41

Slide 41 text

• Loading from CPU Cache is much faster than RAM • Optimize the use of CPU cache by using numpy vector calculation for loop processing • E.g.: calculating the distance to each coupon from the requested location information. Countermeasure (3): numpy 3-2. Slow CPU bound processing Latency Comparison Numbers (~2012) ---------------------------------- L1 cache reference 0.5 ns L2 cache reference 7 ns 14x L1 cache : Main memory reference 100 ns 20x L2 cache, 200x L1 cache — Latency Numbers Every Programmer Should Know

Slide 42

Slide 42 text

Countermeasure (3): numpy 3-2. Slow CPU bound processing Turn into vectors and use vector calculation for future requests We want to calculating online the location information of the request and its distance from each coupon  

Slide 43

Slide 43 text

3-3. Slow logging: Slow MultiThread ✤ Symptoms • Uses CPU for threads other than FastAPI • Check with Datadog • GIL values are low in py-spy top ✤ Causes • A library called logru was writing logs in MultiThread, and that was taking a long time • Python can handle threads, but it is not efficient because of GIL

Slide 44

Slide 44 text

3-3. Slow logging: Slow MultiThread In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. — Global Interpreter Lock GIL / Global Interpreter Lock • Even if you set up threads, only one thread will be run at a time in the process • In Python, parallel processing is handled by Multi-Processing. Combining that with parallel processing with asyncio can solve the problem.

Slide 45

Slide 45 text

3-3. Slow logging: Slow MultiThread Serialization and emmit of log information are taking a long time Seems to be retrieving from Queue to deserialize If _enqueue is true (default), create a thread for SimpleQueue and log output multiprocessing.SimpleQueue is actually a pipe for communication between processes Data transfer to and from WriterThread seems to be slow

Slide 46

Slide 46 text

3-3. Slow logging: Slow MultiThread Countermeasure: Suppress logging and change libraries • Switched to fastlogging • I would like to handle relatively large logs, but this is still unresolved.

Slide 47

Slide 47 text

• There are various processing systems other than CPython that we usually use. • Processes that compile to a lower level language, processes with JIT compiler, etc... • I tried two and selected PyPy this time. • PyPy • cinder • If the CPU is the bottleneck, overall performance improvement can be expected. 3-4. Improving overall performance through changes in the processing system

Slide 48

Slide 48 text

cinder Cinder is Instagram's internal performance-oriented production version of CPython 3.8. It contains a number of performance optimizations, including bytecode inline caching, eager evaluation of coroutines, … — cinder • Performance-improved CPython from Facebook • We saw a performance improvement of about +10% in our environment. • It is highly compatible with CPython and most of its libraries • However, you need to build it yourself, and I gave up on it because the GitHub is not very active, and there was no documentation, so it was difficult to manage. 3-4. Improving overall performance through changes in the processing system

Slide 49

Slide 49 text

PyPy A fast, compliant alternative implementation of Python On an average, PyPy is 4.2 times faster(!) than CPython — PyPy • JIT compiler/Incminimark GC • It is also quite compatible with CPython. • There have been no compatibility-related problems so far. • The performance benefits were so great that we now use it in production. • We confirmed an improvement in performance by nearly 40% 3-4. Improving overall performance through changes in the processing system

Slide 50

Slide 50 text

PyPy - This is not a silver bullet ✤ Challenges faced in replacement and operation • The latest version is 3.7. • Some libraries are not available. • OOM death due to omission of GC Option specification • Memory leak when combined with FastAPI? 3-4. Improving overall performance through changes in the processing system

Slide 51

Slide 51 text

Problems in PyPy (1): The latest version is 3.7 ✤ Problem: 3.7 is the latest version as of 2021 • 3.8 beta ✤ Solution • Since I was developing based on version 3.8, I downgraded some parts • Walrus operator := • position-specific argument def huga(hoge, /, …) • I have not faced any real problems. 3-4. Improving overall performance through changes in the processing system

Slide 52

Slide 52 text

Problems in PyPy (2): Some libraries cannot be used ✤ Problem: PEP 517 Libraries that involve building external sources are almost always a problem. • E.g. • orjson … high-speed JSON serializer from rust 😭 • dd-trace-py … for measuring DataDogAPM 😭 • fastapi-profiler … the profiler that has been working wonders 😭 • black(caused by typed-ast) … Linter/Formatter, used through pysen 😭 ✤ Solution • Use separate processing systems for development and deployment environments 3-4. Improving overall performance through changes in the processing system

Slide 53

Slide 53 text

To prevent bugs, such as the program working in CPython but not in PyPy, we implement e2e tests using production containers during CD For development: CPython3.7 We also implemented UT on CI For production: PyPy3.7 3-4. Improving overall performance through changes in the processing system Problems in PyPy (2): Some libraries cannot be used

Slide 54

Slide 54 text

✤ Problem: Memory increases to the maximum and dies with OOMKiller. • GC for PyPy is incminimark • GCOption must be specified ✤ Solution • Output GC with PYPY_GC_DEBUG=2 • Restrict the heap limit with PYPY_GC_MAX 3-4. Improving overall performance through changes in the processing system Problems in PyPy (2): Some libraries cannot be used

Slide 55

Slide 55 text

Problems in PyPy (3): OOM death due to omission of GC Option specification Ensure MaxWorker x PYPY_GC_MAX   At least specify PYPY_GC_MAX 3-4. Improving overall performance through changes in the processing system

Slide 56

Slide 56 text

Problems in PyPy (4): Memory leak when combined with FastAPI? ✤ Problem: Memory increases to the maximum and dies with OOMKiller. • Even on Echo servers, memory increases monotonically under load • I had not set KeepAlive on nginx... ✤ Solution • Resolved by connecting HTTP 1.1 (Default KeepAlive) • Set timeout-keep-alive on uvicorn also • I have not dug deep into this, but this seems to be a connection problem, so it may be more due to uvicorn than FastAPI. 3-4. Improving overall performance through changes in the processing system

Slide 57

Slide 57 text

3. Problems Faced [Throughput] 50rps [95%tile Latency] 300ms [Throughput] 250rps(500%⬆) [95%tile Latency] 30ms(1000%⬆) Performance improvement per instance after handling all

Slide 58

Slide 58 text

1. Introduction 2. Load Testing and Profiling 3. Problems Faced and Countermeasures 4. Summary Agenda

Slide 59

Slide 59 text

Summary ✤ Load Testing and Profiling • Determine targets → Measure → Resolve bottlenecks • Useful tools: locust, datadog, fastapi-profiler, py-spy ✤ Problems and solutions • I/O bound: asyncio, cache • CPU bound: cache, Cython, numpy • Change of processing system: PyPy

Slide 60

Slide 60 text

Thank You for Your Kind Attention! twitter: @martin_lover_se