Slide 1

Slide 1 text

クックパッドでの サービスメッシュについて Taiki Ono, Cookpad Inc.

Slide 2

Slide 2 text

Agenda • Background • Problems • Introducing and operations • Key results • Next challenges

Slide 3

Slide 3 text

Background

Slide 4

Slide 4 text

Cookpad • "Make everyday cooking fun!" • Originally started in Japan in 1997 • Operate in over 23 languages, 68 countries

Slide 5

Slide 5 text

Scale • 200+ product developers • 100+ production services • 90M Monthly Average User

Slide 6

Slide 6 text

Organization structure Service Team SRE team etc

Slide 7

Slide 7 text

Technology stack • Ruby on Rails for both web frontend and backend apps • Python for ML apps • Go for backend app • Rust, Swift, Java etc.. for internal apps

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Problems

Slide 10

Slide 10 text

Operational problems • Decrease in system reliability • Hard to trouble shoot and debug ‣ Increase of time detect root causes of incidents ‣ Capacity planing

Slide 11

Slide 11 text

Solutions • Expeditor ‣ Ruby library inspired by Netflix's Hystrix • aws-xray ‣ Ruby library for distributed tracing using AWS's X-Ray service https://github.com/cookpad

Slide 12

Slide 12 text

http://techlife.cookpad.com/entry/2017/09/06/115710

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

GoPythonRustJavaSwift apps? • Limitation of library model approach ‣ More for product development ‣ Controlling library versions is hard • Planning to develop our proxy and mixed with consul-template

Slide 15

Slide 15 text

Service mesh to the rescue

Slide 16

Slide 16 text

at SRECON America 2017 "Lyft's Envoy: Experiences Operating a Large Service Mesh"

Slide 17

Slide 17 text

Replacing libraries to proxy

Slide 18

Slide 18 text

control-plane

Slide 19

Slide 19 text

Introducing and operating service mesh

Slide 20

Slide 20 text

Timeline • Early 2017: making plan • Late 2017: building MVP • Early 2018: generally available

Slide 21

Slide 21 text

Envoy • Publicity released at mid 2016 • Lightweight • Graceful reloading • gRPC support https://github.com/envoyproxy/envoy

Slide 22

Slide 22 text

Plan: in-house • Early 2017: no Istio • We use Amazon ECS • Not using full features of Envoy • Resiliency and observability parts

Slide 23

Slide 23 text

Goals • Control resiliency settings by Ops ‣ Centrally managed ‣ Review flow • All metrics should go into Prometheus • Low operation cost ‣ Less components, use of managed services

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Configuration contents • Jsonnet • Route config ‣ Retry, timeouts for paths, domains • Cluster config ‣ DNS name of internal ELB ‣ Circuit breaker settings https://github.com/cookpad/kumonos

Slide 26

Slide 26 text

Drop statsd-relay • Adding tags to metrics with DogStatsd format • Less component is preferable ‣ Send PRs to Envoy ‣ dog_statsd sink and fixed tag configuration are available

Slide 27

Slide 27 text

gRPC infrastructure • Need L7 proxy for HTTP/2 traffic • Let's extend control-plane

Slide 28

Slide 28 text

ServiceDiscoveryService API • lyft/discovery ‣ Reference implementation of SDS API • Moved to cookpad/sds

Slide 29

Slide 29 text

The hard point of ECS • Copy current ECS service • Wait then switch • Delete old one

Slide 30

Slide 30 text

Generally available

Slide 31

Slide 31 text

Operations

Slide 32

Slide 32 text

Dashboards • Prometheus • Grafana ‣ Per service ‣ Per servie-to-service ‣ Envoy instances • Vizceral ‣ promviz, promviz-front

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Envoy on EC2 • Build and distribute as a in-house deb package • Manage as a systemd service • Use hot-restarter.py ‣ Generate starter script for each host role

Slide 39

Slide 39 text

wait-side-car • Sidecar Envoy containers need a few seconds to be up ‣ For background jobs • Wrapper command-line tool ‣ cookpad/wait-side-car https://github.com/cookpad/wait-side-car

Slide 40

Slide 40 text

https://techlife.cookpad.com/entry/2018/04/02/140846

Slide 41

Slide 41 text

Key results

Slide 42

Slide 42 text

Resiliency • Eliminates temporal burst of errors from backend services • Speed of reviewing settings and deployment • Fault isolation: not yet remarkable result

Slide 43

Slide 43 text

Observability • Decrease of time to detect root causes around service communication issues • Visualization of how resilience mechanism is working • One of sources of Service Level Indicator

Slide 44

Slide 44 text

Continuous Growth of platform • Improve application platform without application deployment • Increase velocity of platform development team

Slide 45

Slide 45 text

Next challenges

Slide 46

Slide 46 text

Next challenges • v2 xDS migration • More effective traffic control • Chaos engineering platform • Distributed tracing • Auth[z, n]

Slide 47

Slide 47 text

Wrap up

Slide 48

Slide 48 text

Wrap up • Issues around service communications • Introducing service mesh instead of doing library approach • Key results: resiliency, observability, platform improvement

Slide 49

Slide 49 text

Q&A • Twitter: @taiki45 • http://techlife.cookpad.com/ • EnvoyCon 2018 https:// events.linuxfoundation.org/events/ kubecon-cloudnativecon-north- america-2018/co-located-events/ envoycon/