Agenda
• Background
• Problems
• Introducing and operations
• Key results
• Next challenges
Slide 3
Slide 3 text
Background
Slide 4
Slide 4 text
Cookpad
• "Make everyday cooking fun!"
• Originally started in Japan in 1997
• Operate in over 23 languages, 68
countries
Slide 5
Slide 5 text
Scale
• 200+ product developers
• 100+ production services
• 90M Monthly Average User
Slide 6
Slide 6 text
Organization structure
Service Team
SRE team etc
Slide 7
Slide 7 text
Technology stack
• Ruby on Rails for both web frontend
and backend apps
• Python for ML apps
• Go for backend app
• Rust, Swift, Java etc.. for internal
apps
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
Problems
Slide 10
Slide 10 text
Operational problems
• Decrease in system reliability
• Hard to trouble shoot and debug
‣ Increase of time detect root causes of
incidents
‣ Capacity planing
Slide 11
Slide 11 text
Solutions
• Expeditor
‣ Ruby library inspired by Netflix's
Hystrix
• aws-xray
‣ Ruby library for distributed tracing
using AWS's X-Ray service
https://github.com/cookpad
GoPythonRustJavaSwift apps?
• Limitation of library model
approach
‣ More for product development
‣ Controlling library versions is hard
• Planning to develop our proxy and
mixed with consul-template
Slide 15
Slide 15 text
Service mesh to the rescue
Slide 16
Slide 16 text
at SRECON America 2017
"Lyft's Envoy: Experiences Operating a Large Service Mesh"
Slide 17
Slide 17 text
Replacing libraries to proxy
Slide 18
Slide 18 text
control-plane
Slide 19
Slide 19 text
Introducing and
operating service mesh
Slide 20
Slide 20 text
Timeline
• Early 2017: making plan
• Late 2017: building MVP
• Early 2018: generally available
Slide 21
Slide 21 text
Envoy
• Publicity released at mid 2016
• Lightweight
• Graceful reloading
• gRPC support
https://github.com/envoyproxy/envoy
Slide 22
Slide 22 text
Plan: in-house
• Early 2017: no Istio
• We use Amazon ECS
• Not using full features of Envoy
• Resiliency and observability parts
Slide 23
Slide 23 text
Goals
• Control resiliency settings by Ops
‣ Centrally managed
‣ Review flow
• All metrics should go into Prometheus
• Low operation cost
‣ Less components, use of managed services
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
Configuration contents
• Jsonnet
• Route config
‣ Retry, timeouts for
paths, domains
• Cluster config
‣ DNS name of internal ELB
‣ Circuit breaker settings
https://github.com/cookpad/kumonos
Slide 26
Slide 26 text
Drop statsd-relay
• Adding tags to metrics
with DogStatsd format
• Less component is
preferable
‣ Send PRs to Envoy
‣ dog_statsd sink and
fixed tag configuration
are available
Slide 27
Slide 27 text
gRPC infrastructure
• Need L7 proxy for
HTTP/2 traffic
• Let's extend
control-plane
Slide 28
Slide 28 text
ServiceDiscoveryService API
• lyft/discovery
‣ Reference
implementation of SDS
API
• Moved to cookpad/sds
Slide 29
Slide 29 text
The hard point of ECS
• Copy current ECS
service
• Wait then switch
• Delete old one
Slide 30
Slide 30 text
Generally available
Slide 31
Slide 31 text
Operations
Slide 32
Slide 32 text
Dashboards
• Prometheus
• Grafana
‣ Per service
‣ Per servie-to-service
‣ Envoy instances
• Vizceral
‣ promviz, promviz-front
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
No content
Slide 36
Slide 36 text
No content
Slide 37
Slide 37 text
No content
Slide 38
Slide 38 text
Envoy on EC2
• Build and distribute as a in-house
deb package
• Manage as a systemd service
• Use hot-restarter.py
‣ Generate starter script for each host
role
Slide 39
Slide 39 text
wait-side-car
• Sidecar Envoy containers need a few
seconds to be up
‣ For background jobs
• Wrapper command-line tool
‣ cookpad/wait-side-car
https://github.com/cookpad/wait-side-car
Resiliency
• Eliminates temporal burst of errors
from backend services
• Speed of reviewing settings and
deployment
• Fault isolation: not yet remarkable
result
Slide 43
Slide 43 text
Observability
• Decrease of time to detect root causes
around service communication issues
• Visualization of how resilience
mechanism is working
• One of sources of Service Level
Indicator
Slide 44
Slide 44 text
Continuous Growth of platform
• Improve application platform without application
deployment
• Increase velocity of platform development team
Slide 45
Slide 45 text
Next challenges
Slide 46
Slide 46 text
Next challenges
• v2 xDS migration
• More effective traffic control
• Chaos engineering platform
• Distributed tracing
• Auth[z, n]
Slide 47
Slide 47 text
Wrap up
Slide 48
Slide 48 text
Wrap up
• Issues around service communications
• Introducing service mesh instead of
doing library approach
• Key results: resiliency, observability,
platform improvement