Building and operating service mesh at mid-size company

Building and operating service mesh at mid-size company Taiki Ono,
Cookpad Inc.

Agenda • Background • Problems • Introducing and operations •
Key results • Next challenges

Background

Cookpad • "Make everyday cooking fun!" • Originally started in
Japan in 1997 • Operate in over 23 languages, 68 countries

Scale • 200+ product developers • 100+ production services •
90M Monthly Average User

Organization structure

Technology stack • Ruby on Rails for both web frontend
and backend apps • Python for ML apps • Go for backend app • Rust, Swift, Java etc.. for internal apps

Problems

Operational problems • Decrease in system reliability • Hard to
trouble shoot and debug ‣ Increase of time detect root causes of incidents ‣ Capacity planing

Solutions • Expeditor ‣ Ruby library inspired by Netﬂix's Hystrix
• aws-xray ‣ Ruby library for distributed tracing using AWS's X-Ray service https://github.com/cookpad

http://techlife.cookpad.com/entry/2017/09/06/115710

GoPythonRustJavaSwift apps? • Limitation of library model approach ‣ More
for product development ‣ Controlling library versions is hard • Planning to develop our proxy and mixed with consul-template

Service mesh to the rescue

at SRECON America 2017 "Lyft's Envoy: Experiences Operating a Large
Service Mesh"

Replacing libraries to proxy

control-plane

Introducing and operating service mesh

Timeline • Early 2017: making plan • Late 2017: building
MVP • Early 2018: generally available

Envoy • Publicity released at mid 2016 • Lightweight •
Graceful reloading • gRPC support https://github.com/envoyproxy/envoy

Plan: in-house • Early 2017: no Istio • We use
Amazon ECS • Not using full features of Envoy • Resiliency and observability

Goals • Control resiliency settings by Ops ‣ Centrally managed
‣ Review ﬂow • All metrics should go into Prometheus • Low operation cost ‣ Less components, use of managed services

Configuration contents • Jsonnet • Route config ‣ Retry, timeouts
for paths, domains • Cluster config ‣ DNS name of internal ELB ‣ Circuit breaker settings https://github.com/cookpad/kumonos

Drop statsd-relay • Adding tags to metrics with DogStatsd format
• Less component is preferable ‣ Send PRs to Envoy ‣ dog_statsd sink and ﬁxed tag conﬁguration are available

gRPC infrastructure • Need L7 proxy for HTTP/2 traﬃc •
Let's extend control-plane

ServiceDiscoveryService API • lyft/discovery ‣ Reference implementation of SDS API
• Moved to cookpad/sds

The hard point of ECS • Copy current ECS service
• Wait then switch • Delete old one

Generally available

Operations

Dashboards • Prometheus • Grafana ‣ Per service ‣ Per
servie-to-service ‣ Envoy instances • Vizceral ‣ promviz, promviz-front

Envoy on EC2 • Build and distribute as a in-house
deb package • Manage as a systemd service • Use hot-restarter.py ‣ Generate starter script for each host role

wait-side-car • Sidecar Envoy containers need a few seconds to
be up ‣ For background jobs • Wrapper command-line tool ‣ cookpad/wait-side-car https://github.com/cookpad/wait-side-car

https://techlife.cookpad.com/entry/2018/04/02/140846

Key results

Resiliency • Eliminates temporal burst of errors from backend services
• Speed of reviewing settings and deployment • Fault isolation: not yet remarkable result

Observability • Decrease of time to detect root causes around
service communication issues • Visualization of how resilience mechanism is working • One of sources of Service Level Indicator

Growth of platform • Improve application platform without application deployment
• Increase velocity of platform development team

Next challenges

Next challenges • v2 xDS migration / Istio • Chaos
engineering platform • Distributed tracing • Auth[z, n]

Wrap up

Wrap up • Issues around service communications • Introducing service
mesh instead of doing library approach • Key results: resiliency, observability, platform improvement

Q&A • Twitter: @taiki45 • Publish this slide later •
http://techlife.cookpad.com/

Building and operating service mesh at mid-size...

Building and operating service mesh at mid-size company

More Decks by taiki45

Other Decks in Technology

Featured

Transcript