Benchmarking and Performance Improvement of Proxy Server

Confidential - Do Not Share Performance Improvement of Proxy Server
Kazushi Kitaya, Summer Internship @ SRE Sep. 4, 2019 Confidential - Do Not Share

Confidential - Do Not Share Introduction - Kazushi Kitaya -
Junior-year CS student at the University of Tokyo - Loves playing poker - https://github.com/kkty

Confidential - Do Not Share Goal - Performance improvement of
a proxy server named “Chocon“ - https://github.com/kazeburo/chocon

Confidential - Do Not Share Background - The RTT between
Hokkaido and Tokyo is about 20ms - It cannot be ignored especially in TCP hand-shaking, TLS hand-shaking etc. 20ms RTT

Confidential - Do Not Share Background - Chocon is a
proxy server written in Go - The idea is to “keep-alive” connections between the servers in Hokkaido and the servers in Tokyo, though it has some other use cases - It is more effective than having keep-alive connections between each server in Tokyo and each server in Hokkaido (the number of connections would be O(N*N)) Server (Tokyo) Server (Tokyo) Server (Tokyo) Chocon Chocon Chocon Chocon Server (Tokyo) Server (Tokyo) Server (Hokkaido)

Confidential - Do Not Share Background - My project was
to improve Chocon’s performance

Confidential - Do Not Share Designing Benchmarks - It is
very important to design a good benchmark

Confidential - Do Not Share Designing Benchmarks - Need for
simulating high-latency network - Want to run benchmarks on one machine - If we need to set up multiple servers, it would be hard to iterate

Confidential - Do Not Share Designing Benchmarks - Decided to
use Docker containers - Latency can be simulated by modifying their virtual network interfaces - CPU and memory limit for Chocon can be conﬁgured - Wrote a program that spins up containers/networks, runs benchmarks, and collects results - https://github.com/kkty/chocon/tree/master/benchmark - Many parameters were designed to be conﬁgurable; it includes HTTPS/HTTP, the CPU/memory limit for chocon container, network latency, body sizes for each request

Confidential - Do Not Share Designing Benchmarks Benchmark App Create
Load Generator (container) Chocon (container) Echo Server (container) Network veth Modify (add latency) Set resource limit Start Results

Confidential - Do Not Share Designing Benchmarks (Notes) - Docker
containers have very little overhead - Processes inside containers are (almost) identical to normal processes - Apache Bench is not performant - “Hey”, a load generator written in Go, is used - https://github.com/rakyll/hey - “Wrk” was also an option - Latency can be simulated by modifying virtual network interfaces - A pair of virtual network interfaces is created for each container (one for the container and the other on the host) - Apply “tc qdisc netem …” to them

Confidential - Do Not Share Thoughts on benchmarks results -
I ran benchmarks and collected proﬁles - https://golang.org/pkg/net/http/pprof/ was a great help for proﬁling - I found it hard to vastly improve its performance just by tweaking small chunks - I decided to replace net/http with fasthttp - It is an alternative for net/http - It is (said to be) blazingly fast - It has a decent popularity and it is expected to be maintained properly - https://github.com/valyala/fasthttp

Confidential - Do Not Share Re-implementation - https://github.com/kkty/chocon/commit/8eaf11271c0136223e2db1687063c e1c68bd9d1c -
The large part of the existing code had to be rewritten to make the most of fasthttp - We have to work with byte slices a lot for performance - We should be careful about data race - I wrote some tests for detecting them, which lead to the ﬁnding of the fasthttp’s bug - https://github.com/valyala/fasthttp/pull/645

Confidential - Do Not Share Results - Benchmarks with a
lot of parameters...

Confidential - Do Not Share Results - Summary: achieved 1.2x-2x
throughput (depending on CPU/memory limit, latency, etc.) - It performs well especially with multi CPUs

Confidential - Do Not Share Results - Number of heap
memory allocation has been reduced by 80%

Confidential - Do Not Share Results (Notes) - Fasthttp uses
worker-pool model - It is not that goroutines are created each time a request is accepted, which is the case with net/http - String<->[]byte conversions are not free - It involves memory allocation and copy - The functions of fasthttp accept/return []byte instead of string

Confidential - Do Not Share Summary - Carefully designed a
benchmark for Chocon - Reduced the number of heap memory allocations by replacing net/http with fasthttp and other optimizations - Succeeded in improving Chocon’s throughput by 1.2x-2x

Confidential - Do Not Share Thank you

Benchmarking and Performance Improvement of Pro...

Benchmarking and Performance Improvement of Proxy Server

kkty

Other Decks in Programming

Featured

Transcript

Confidential - Do Not Share Performance Improvement of Proxy Server

Confidential - Do Not Share Introduction - Kazushi Kitaya -

Confidential - Do Not Share Goal - Performance improvement of

Confidential - Do Not Share Background - The RTT between

Confidential - Do Not Share Background - Chocon is a

Confidential - Do Not Share Background - My project was

Confidential - Do Not Share Designing Benchmarks - It is

Confidential - Do Not Share Designing Benchmarks - Need for

Confidential - Do Not Share Designing Benchmarks - Decided to

Confidential - Do Not Share Designing Benchmarks Benchmark App Create

Confidential - Do Not Share Designing Benchmarks (Notes) - Docker

Confidential - Do Not Share Thoughts on benchmarks results -

Confidential - Do Not Share Re-implementation - https://github.com/kkty/chocon/commit/8eaf11271c0136223e2db1687063c e1c68bd9d1c -

Confidential - Do Not Share Results - Benchmarks with a

Confidential - Do Not Share Results - Summary: achieved 1.2x-2x

Confidential - Do Not Share Results - Number of heap

Confidential - Do Not Share Results (Notes) - Fasthttp uses

Confidential - Do Not Share Summary - Carefully designed a

Confidential - Do Not Share Thank you