Benchmarking and Performance Improvement of Proxy Server

Slide 1

Slide 1 text

Confidential - Do Not Share Performance Improvement of Proxy Server Kazushi Kitaya, Summer Internship @ SRE Sep. 4, 2019 Confidential - Do Not Share

Slide 2

Slide 2 text

Confidential - Do Not Share Introduction - Kazushi Kitaya - Junior-year CS student at the University of Tokyo - Loves playing poker - https://github.com/kkty

Slide 3

Slide 3 text

Confidential - Do Not Share Goal - Performance improvement of a proxy server named “Chocon“ - https://github.com/kazeburo/chocon

Slide 4

Slide 4 text

Confidential - Do Not Share Background - The RTT between Hokkaido and Tokyo is about 20ms - It cannot be ignored especially in TCP hand-shaking, TLS hand-shaking etc. 20ms RTT

Slide 5

Slide 5 text

Confidential - Do Not Share Background - Chocon is a proxy server written in Go - The idea is to “keep-alive” connections between the servers in Hokkaido and the servers in Tokyo, though it has some other use cases - It is more effective than having keep-alive connections between each server in Tokyo and each server in Hokkaido (the number of connections would be O(N*N)) Server (Tokyo) Server (Tokyo) Server (Tokyo) Chocon Chocon Chocon Chocon Server (Tokyo) Server (Tokyo) Server (Hokkaido)

Slide 6

Slide 6 text

Confidential - Do Not Share Background - My project was to improve Chocon’s performance

Slide 7

Slide 7 text

Confidential - Do Not Share Designing Benchmarks - It is very important to design a good benchmark

Slide 8

Slide 8 text

Confidential - Do Not Share Designing Benchmarks - Need for simulating high-latency network - Want to run benchmarks on one machine - If we need to set up multiple servers, it would be hard to iterate

Slide 9

Slide 9 text

Confidential - Do Not Share Designing Benchmarks - Decided to use Docker containers - Latency can be simulated by modifying their virtual network interfaces - CPU and memory limit for Chocon can be conﬁgured - Wrote a program that spins up containers/networks, runs benchmarks, and collects results - https://github.com/kkty/chocon/tree/master/benchmark - Many parameters were designed to be conﬁgurable; it includes HTTPS/HTTP, the CPU/memory limit for chocon container, network latency, body sizes for each request

Slide 10

Slide 10 text

Confidential - Do Not Share Designing Benchmarks Benchmark App Create Load Generator (container) Chocon (container) Echo Server (container) Network veth Modify (add latency) Set resource limit Start Results

Slide 11

Slide 11 text

Confidential - Do Not Share Designing Benchmarks (Notes) - Docker containers have very little overhead - Processes inside containers are (almost) identical to normal processes - Apache Bench is not performant - “Hey”, a load generator written in Go, is used - https://github.com/rakyll/hey - “Wrk” was also an option - Latency can be simulated by modifying virtual network interfaces - A pair of virtual network interfaces is created for each container (one for the container and the other on the host) - Apply “tc qdisc netem …” to them

Slide 12

Slide 12 text

Confidential - Do Not Share Thoughts on benchmarks results - I ran benchmarks and collected proﬁles - https://golang.org/pkg/net/http/pprof/ was a great help for proﬁling - I found it hard to vastly improve its performance just by tweaking small chunks - I decided to replace net/http with fasthttp - It is an alternative for net/http - It is (said to be) blazingly fast - It has a decent popularity and it is expected to be maintained properly - https://github.com/valyala/fasthttp

Slide 13

Slide 13 text

Confidential - Do Not Share Re-implementation - https://github.com/kkty/chocon/commit/8eaf11271c0136223e2db1687063c e1c68bd9d1c - The large part of the existing code had to be rewritten to make the most of fasthttp - We have to work with byte slices a lot for performance - We should be careful about data race - I wrote some tests for detecting them, which lead to the ﬁnding of the fasthttp’s bug - https://github.com/valyala/fasthttp/pull/645

Slide 14

Slide 14 text

Confidential - Do Not Share Results - Benchmarks with a lot of parameters...

Slide 15

Slide 15 text

Confidential - Do Not Share Results - Summary: achieved 1.2x-2x throughput (depending on CPU/memory limit, latency, etc.) - It performs well especially with multi CPUs

Slide 16

Slide 16 text

Confidential - Do Not Share Results - Number of heap memory allocation has been reduced by 80%

Slide 17

Slide 17 text

Confidential - Do Not Share Results (Notes) - Fasthttp uses worker-pool model - It is not that goroutines are created each time a request is accepted, which is the case with net/http - String<->[]byte conversions are not free - It involves memory allocation and copy - The functions of fasthttp accept/return []byte instead of string

Slide 18

Slide 18 text

Confidential - Do Not Share Summary - Carefully designed a benchmark for Chocon - Reduced the number of heap memory allocations by replacing net/http with fasthttp and other optimizations - Succeeded in improving Chocon’s throughput by 1.2x-2x

Slide 19

Slide 19 text

Confidential - Do Not Share Thank you