Technology stack • Ruby on Rails for both web frontend and backend apps • Python for ML apps • Go for backend app • Rust, Swift, Java etc.. for internal apps
Operational problems • Decrease in system reliability • Hard to trouble shoot and debug ‣ Increase of time detect root causes of incidents ‣ Capacity planing
GoPythonRustJavaSwift apps? • Limitation of library model approach ‣ More for product development ‣ Controlling library versions is hard • Planning to develop our proxy and mixed with consul-template
Goals • Control resiliency settings by Ops ‣ Centrally managed ‣ Review flow • All metrics should go into Prometheus • Low operation cost ‣ Less components, use of managed services
Drop statsd-relay • Adding tags to metrics with DogStatsd format • Less component is preferable ‣ Send PRs to Envoy ‣ dog_statsd sink and fixed tag configuration are available
Envoy on EC2 • Build and distribute as a in-house deb package • Manage as a systemd service • Use hot-restarter.py ‣ Generate starter script for each host role
wait-side-car • Sidecar Envoy containers need a few seconds to be up ‣ For background jobs • Wrapper command-line tool ‣ cookpad/wait-side-car https://github.com/cookpad/wait-side-car
Resiliency • Eliminates temporal burst of errors from backend services • Speed of reviewing settings and deployment • Fault isolation: not yet remarkable result
Observability • Decrease of time to detect root causes around service communication issues • Visualization of how resilience mechanism is working • One of sources of Service Level Indicator
Wrap up • Issues around service communications • Introducing service mesh instead of doing library approach • Key results: resiliency, observability, platform improvement