Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Benchmarking and Performance Improvement of Pro...

Avatar for kkty kkty
September 20, 2019

Benchmarking and Performance Improvement of Proxy Server

Avatar for kkty

kkty

September 20, 2019
Tweet

Other Decks in Programming

Transcript

  1. Confidential - Do Not Share Performance Improvement of Proxy Server

    Kazushi Kitaya, Summer Internship @ SRE Sep. 4, 2019 Confidential - Do Not Share
  2. Confidential - Do Not Share Introduction - Kazushi Kitaya -

    Junior-year CS student at the University of Tokyo - Loves playing poker - https://github.com/kkty
  3. Confidential - Do Not Share Goal - Performance improvement of

    a proxy server named “Chocon“ - https://github.com/kazeburo/chocon
  4. Confidential - Do Not Share Background - The RTT between

    Hokkaido and Tokyo is about 20ms - It cannot be ignored especially in TCP hand-shaking, TLS hand-shaking etc. 20ms RTT
  5. Confidential - Do Not Share Background - Chocon is a

    proxy server written in Go - The idea is to “keep-alive” connections between the servers in Hokkaido and the servers in Tokyo, though it has some other use cases - It is more effective than having keep-alive connections between each server in Tokyo and each server in Hokkaido (the number of connections would be O(N*N)) Server (Tokyo) Server (Tokyo) Server (Tokyo) Chocon Chocon Chocon Chocon Server (Tokyo) Server (Tokyo) Server (Hokkaido)
  6. Confidential - Do Not Share Designing Benchmarks - It is

    very important to design a good benchmark
  7. Confidential - Do Not Share Designing Benchmarks - Need for

    simulating high-latency network - Want to run benchmarks on one machine - If we need to set up multiple servers, it would be hard to iterate
  8. Confidential - Do Not Share Designing Benchmarks - Decided to

    use Docker containers - Latency can be simulated by modifying their virtual network interfaces - CPU and memory limit for Chocon can be configured - Wrote a program that spins up containers/networks, runs benchmarks, and collects results - https://github.com/kkty/chocon/tree/master/benchmark - Many parameters were designed to be configurable; it includes HTTPS/HTTP, the CPU/memory limit for chocon container, network latency, body sizes for each request
  9. Confidential - Do Not Share Designing Benchmarks Benchmark App Create

    Load Generator (container) Chocon (container) Echo Server (container) Network veth Modify (add latency) Set resource limit Start Results
  10. Confidential - Do Not Share Designing Benchmarks (Notes) - Docker

    containers have very little overhead - Processes inside containers are (almost) identical to normal processes - Apache Bench is not performant - “Hey”, a load generator written in Go, is used - https://github.com/rakyll/hey - “Wrk” was also an option - Latency can be simulated by modifying virtual network interfaces - A pair of virtual network interfaces is created for each container (one for the container and the other on the host) - Apply “tc qdisc netem …” to them
  11. Confidential - Do Not Share Thoughts on benchmarks results -

    I ran benchmarks and collected profiles - https://golang.org/pkg/net/http/pprof/ was a great help for profiling - I found it hard to vastly improve its performance just by tweaking small chunks - I decided to replace net/http with fasthttp - It is an alternative for net/http - It is (said to be) blazingly fast - It has a decent popularity and it is expected to be maintained properly - https://github.com/valyala/fasthttp
  12. Confidential - Do Not Share Re-implementation - https://github.com/kkty/chocon/commit/8eaf11271c0136223e2db1687063c e1c68bd9d1c -

    The large part of the existing code had to be rewritten to make the most of fasthttp - We have to work with byte slices a lot for performance - We should be careful about data race - I wrote some tests for detecting them, which lead to the finding of the fasthttp’s bug - https://github.com/valyala/fasthttp/pull/645
  13. Confidential - Do Not Share Results - Summary: achieved 1.2x-2x

    throughput (depending on CPU/memory limit, latency, etc.) - It performs well especially with multi CPUs
  14. Confidential - Do Not Share Results - Number of heap

    memory allocation has been reduced by 80%
  15. Confidential - Do Not Share Results (Notes) - Fasthttp uses

    worker-pool model - It is not that goroutines are created each time a request is accepted, which is the case with net/http - String<->[]byte conversions are not free - It involves memory allocation and copy - The functions of fasthttp accept/return []byte instead of string
  16. Confidential - Do Not Share Summary - Carefully designed a

    benchmark for Chocon - Reduced the number of heap memory allocations by replacing net/http with fasthttp and other optimizations - Succeeded in improving Chocon’s throughput by 1.2x-2x