Slide 1

Slide 1 text

SRE NEXT 2020 Designing fault-tolerant microservices with SRE and circuit breaker centric architecture Takayuki Watanabe Cookpad Inc. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe

Slide 2

Slide 2 text

Who? Name: Takayuki Watanabe Affiliation: Cookpad Inc. Job: Site Reliability Engineering Chapter Lead Sns: Blog: blog.takanabe.tokyo GitHub: takanabe Twitter: @takanabe_w Interests: - Chaos Engineering - Distributed Systems - Resilience Engineering SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 2

Slide 3

Slide 3 text

Menu • About Cookpad Global • Search-v2 and ML APIs • Gaps: ideal and reality • Designing fault-tolerant microservices with SRE and circuit breaker centric architecture SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 3

Slide 4

Slide 4 text

Out of scope • Monolith vs SOA vs Microservices • So2ware design and development in Cloud Na

Slide 5

Slide 5 text

About Cookpad Global SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 5

Slide 6

Slide 6 text

Cookpad Global by numbers • 42,700,000 monthly users • 3,160,000 recipes • 74 countries • 32 languages SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 6

Slide 7

Slide 7 text

Cookpad Global by numbers • 1 monolith + 7 microservices in produc5on • 300+ spot instances for ECS clusters • 400+ deployments per ECS task defini5on per day • 20 deployements to produc5on per day 7

Slide 8

Slide 8 text

Cookpad Global by numbers • 23 backend developers (Ruby:19, Python:4) • 5 Site Reliability Engineers 8

Slide 9

Slide 9 text

See more details on Speaker Deck ... 1,2 2 Cookpad TechConf 2019, Challenges for Global Service from a Perspec>ve of SRE ~ 2nd season ~ 1 Cookpad TechConf 2018, Challenges for Global Service from a Perspec>ve of SRE SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 9

Slide 10

Slide 10 text

Go back to 2019... SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 10

Slide 11

Slide 11 text

Make everyday cooking fun! SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 11

Slide 12

Slide 12 text

Search is essen+al3 3 Go Global - #CookpadTechconf 2017 SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 12

Slide 13

Slide 13 text

Can users reach the best recipes out of 3,160,000 recipes? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 13

Slide 14

Slide 14 text

No... SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 14

Slide 15

Slide 15 text

Search-v2 and ML APIs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 15

Slide 16

Slide 16 text

Search-v2 and ML APIs • Search-v2: people can meet their favorite recipes for cooking • (e.g) Personalized search, visual search, recommenda@ons • ML APIs: Other APIs can provide machine learning integrated features • (e.g) Image enhancement, image to recipe SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 16

Slide 17

Slide 17 text

got it. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 17

Slide 18

Slide 18 text

So, who develops them? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 18

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

20

Slide 21

Slide 21 text

Machine learning researcher ≠ SWE in machine learning SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 21

Slide 22

Slide 22 text

22

Slide 23

Slide 23 text

4 search/machine learning integra0on engineers joined 23

Slide 24

Slide 24 text

24

Slide 25

Slide 25 text

Everthing goes smoothly!! SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 25

Slide 26

Slide 26 text

Everthing goes smoothly!! SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 26

Slide 27

Slide 27 text

Gaps: ideal and reality SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 27

Slide 28

Slide 28 text

Gaps Organiza(on & technology stack SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 28

Slide 29

Slide 29 text

29

Slide 30

Slide 30 text

Microservice architecture = Each team can use any technology we want SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 30

Slide 31

Slide 31 text

Microservice architecture = Each team can use any technology we want SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 31

Slide 32

Slide 32 text

Do we finish decoupling monolith to microservices? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 32

Slide 33

Slide 33 text

Do we have enough developers? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 33

Slide 34

Slide 34 text

Can we transfer internal resources and knowledge to other teams? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 34

Slide 35

Slide 35 text

Need more efforts to gain benetfits from microservice architecture SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 35

Slide 36

Slide 36 text

We restrict technology stack we use SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 36

Slide 37

Slide 37 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 37

Slide 38

Slide 38 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 38

Slide 39

Slide 39 text

Is it possible to develop search-v2/ML APIs with those tech stacks? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 39

Slide 40

Slide 40 text

Break barriers. Otherwise, no future SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 40

Slide 41

Slide 41 text

As-Is Developers use restricted technology stack To-Be Search/ML team can use mainstream technology stack for their fields SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 41

Slide 42

Slide 42 text

Gaps Expecta(on against service level SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 42

Slide 43

Slide 43 text

43

Slide 44

Slide 44 text

44

Slide 45

Slide 45 text

This service is experimental This service is beta This service is prototype This service is [ANY EXPRESSIONS] 45

Slide 46

Slide 46 text

Low service level APIs poten2ally cause cascading outages SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 46

Slide 47

Slide 47 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 47

Slide 48

Slide 48 text

As-Is Produc'on is down due to outages of new microservices To-Be No produc)on outages due to low service level microservices SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 48

Slide 49

Slide 49 text

Gaps Team capacity SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 49

Slide 50

Slide 50 text

50

Slide 51

Slide 51 text

Does team have enough capacity for on-call? "Assuming that there are always two people on-call (primary and secondary, with different du:es), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shi?s, each engineer is on-call (primary or secondary) for one week every month." 4 "For produc7on on-call responsibili7es, I’ve found that two-7er 24/7 support requires eight engineers. As teams holding their own pagers have become increasingly mainstream, this has become an important sizing constraint, and I try to ensure that every engineering team’s steady state is eight people" 5 5 Larson, Will. An Elegant Puzzle: Systems of Engineering Management, 2.1 Sizing teams (p.33) 4 Google - Site Reliability Engineering Chapter 11 - Being On-Call SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 51

Slide 52

Slide 52 text

As-Is People have to be responsible for on-call rota0ons for new mircorservices To-Be New search/ml team must be free from on-call pressures for their new microservices SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 52

Slide 53

Slide 53 text

Gaps Knowledge for product development SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 53

Slide 54

Slide 54 text

54

Slide 55

Slide 55 text

As-Is Many teams need tough nego)a)ons to release ML related features To-Be ML team can release experimental features with light process in produc0on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 55

Slide 56

Slide 56 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Produc6on outages due to new microservices No produc@on outages due to low service level microservices People have to be responsible for on- call rota6ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Many teams need tough nego6a6ons to release ML related features ML team can release experimental features with light process in produc@on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 56

Slide 57

Slide 57 text

Designing fault-tolerant microservices with SRE and circuit breaker centric architecture SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 57

Slide 58

Slide 58 text

58

Slide 59

Slide 59 text

Design Docs • Reach consensus against scopes and expecta2ons 6 • In Cookpad, only SRE team knows en2re system designs 7 7 Google, The Site Reliability Workbook, Chapter 7 - Simplicity 6 Google, Site Reliability Engineering, Chapter 31 - Communica

Slide 60

Slide 60 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 60

Slide 61

Slide 61 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 61

Slide 62

Slide 62 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Produc=on outages due to new microservices No produc=on outages due to low service level microservices Design document People have to be responsible for on-call rota=ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego=a=ons to release ML related features ML team can release experimental features with light process in produc=on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 62

Slide 63

Slide 63 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document + ? Produc8on outages due to new microservices No produc8on outages due to low service level microservices Design document People have to be responsible for on- call rota8ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego8a8ons to release ML related features ML team can release experimental features with light process in produc8on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 63

Slide 64

Slide 64 text

Approach SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 64

Slide 65

Slide 65 text

Delega&on and resource isola&on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 65

Slide 66

Slide 66 text

Resource isola,on = AWS resource isola/on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 66

Slide 67

Slide 67 text

67

Slide 68

Slide 68 text

Implementa)on pa,ern • IAM (delega,on level: low) • IAM Permissions Boundary (delega,on level: medium) • Dedecated AWS account (delega,on level: high) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 68

Slide 69

Slide 69 text

Dedicated AWS account • Use AWS Organiza0ons to issue new AWS account • Design network by SRE • Build VPC peering between new and old VPCs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 69

Slide 70

Slide 70 text

70

Slide 71

Slide 71 text

Search/ML team can use mainstream technology for their fields 71

Slide 72

Slide 72 text

Transparent security and audit support • Enforce managed audit and security service on AWS • VPCFlowLogs • CloudTrail • GuardDuty • AWS Config 72

Slide 73

Slide 73 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega1on and resource isola1on Produc=on outages due to new microservices No produc=on outages due to low service level microservices Design document People have to be responsible for on-call rota=ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego=a=ons to release ML related features ML team can release experimental features with light process in produc=on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 73

Slide 74

Slide 74 text

Don't accept excep,ons • We only have 3 SREs (in 2019) • Follow the boundary we define in the design document • Don't share servers managed by SRE team • Use SaaS to accelerate minimum product development cycles 8 • e.g: CI • e.g: Observability 8 Prac'cal Monitoring: Effec've Strategies for the Real World, Chapter 2.3 PaAern #3: Buy, Not Build SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 74

Slide 75

Slide 75 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega

Slide 76

Slide 76 text

76

Slide 77

Slide 77 text

Disconnec(ng unstable produc(on microservices makes sense 77

Slide 78

Slide 78 text

Approach SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 78

Slide 79

Slide 79 text

Circuit breaker centric architecture SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 79

Slide 80

Slide 80 text

Why Circuit Breaker? • Fail fast strategy to prevent cascading failures • Limits external service and network impacts • Don’t waste capacity calling a broken service • External service is slow • External service is down • Network is unstable SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 80

Slide 81

Slide 81 text

State transi*on diagram SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 81

Slide 82

Slide 82 text

Case study SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 82

Slide 83

Slide 83 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 83

Slide 84

Slide 84 text

• Closed • Traffic flows normaly • Health is assessed every 100ms based on a 10s rolling average • Open / Tripped • Fail fast - return 503 error • Stays in this state for 10s • Recovering / Half Open • Ramp up traffic over 10s • Check health every 100ms -> if fail go back to Open state • Return to Closed if health is OK aJer 10s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 84

Slide 85

Slide 85 text

Circuit Breaker - Implica1ons • We can introduce experimental and new services with less risk to other parts of the applica8on • Slow responses ~= Outage! • Fallback strategies become more important • Add values to use SLOs for communica8on tools about service levels SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 85

Slide 86

Slide 86 text

Implementa)on pa,ern • Applica(on library (e.g: cookpad/expeditor, Ne;lix/Hystrix) • Proxy (e.g: Envoy Proxy, Traefik) • Service Mesh (e.g: Is(o, Maesh) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 86

Slide 87

Slide 87 text

Circuit breaker proxy side-car container • Use a L7 reverse proxy with circuit breaking middleware • Each microservice has it's own independently configured circuit breaker • Run as a sidecar container SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 87

Slide 88

Slide 88 text

Traefik as circuit breaker proxy • NetworkErrorRa+o • Covers networking errors connec0ng to the service • Shedding load can help some errors to recover! • ResponseCodeRa+o • Don’t bother calling broken serivice • LatencyAtQuan+leMS • Isolate slow services. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 88

Slide 89

Slide 89 text

Traefik configura-on example { service1: { backend: 'http://service1_endpoint', circuit_breaker: "LatencyAtQuantileMS(50.0) > 1000 || ResponseCodeRatio(500, 600, 0, 600) > 0.30 || NetworkErrorRatio() > 0.10", }, service2: { backend: 'http://service2_endpoint', circuit_breaker: "LatencyAtQuantileMS(50.0) > 3000 || ResponseCodeRatio(500, 600, 0, 600) > 0.10 || NetworkErrorRatio() > 0.10", }, } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 89

Slide 90

Slide 90 text

90

Slide 91

Slide 91 text

How do we decide threshold? 91

Slide 92

Slide 92 text

SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 92

Slide 93

Slide 93 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 93

Slide 94

Slide 94 text

Can developers define SLO? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 94

Slide 95

Slide 95 text

Availability class SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 95

Slide 96

Slide 96 text

Availability class • We customize produc0on readiness check as availablity class (a.k.a produc0on readiness review 9) 9 Google - Site Reliability Engineering, Chapter 32 - The Evolving SRE Engagement Model SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 96

Slide 97

Slide 97 text

Availability class presets • Baseline • Medium • High • No SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 97

Slide 98

Slide 98 text

Baseline availability class Availablity Target: > 95% Period Down*me Budget Daily 1h 12m Weekly 8h 24 Monthly 36h 31m SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 98

Slide 99

Slide 99 text

Medium availability class Availablity Target: > 99% Period Down*me Budget Daily 14m 24s Weekly 1h 41m Monthly 7h 18m SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 99

Slide 100

Slide 100 text

High availability class Availablity Target: > 99.9% Period Down*me Budget Daily 1m 26s Weekly 10m 4s Monthly 43m 49s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 100

Slide 101

Slide 101 text

101

Slide 102

Slide 102 text

102

Slide 103

Slide 103 text

How do we know the service level? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 103

Slide 104

Slide 104 text

Aler%ng on SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 104

Slide 105

Slide 105 text

Implemen'ng alerts on SLO There are several strategies to implement alerts on SLO 10 • Target Error Rate ≥ SLO Threshold • Increased Alert Window • Incremen

Slide 106

Slide 106 text

Implemen'ng alerts on SLO There are several strategies to implement alerts on SLO 10 • Target Error Rate ≥ SLO Threshold • Increased Alert Window • Incremen

Slide 107

Slide 107 text

Burn rate Burn rate is how fast a service consumes the error budget on SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 107

Slide 108

Slide 108 text

Burn rates and +me to complete budget exhaus+on 10 Burn rate Error rate for 99.9% SLO Time to exhaus8on 1 0.1% 30 days 2 0.2% 15 days 10 1% 3 days 1000 100% 43minutes 10 Google - The Site Reliability Workbook, Chapter 5: Aler

Slide 109

Slide 109 text

Burn rates and +me to complete budget exhaus+on 10 10 Google - The Site Reliability Workbook, Chapter 5: Aler

Slide 110

Slide 110 text

Mul$window, Mul$-Burn-Rate Alerts 10 • This approach provides good precision alerts and reduce the number of false posi7ves • Make the short window 1/12 the dura7on of the long window as the star7ng point Severity No*fica*on Long window Short window Burn rate Error budget consumed Cri$cal Pager 1 hour 5 minutes 14.4 2% Cri$cal Pager 6 hour 30 minutes 6 5% Warning Chat, $cket 3 days 6 hours 1 10% 10 Google - The Site Reliability Workbook, Chapter 5: Aler

Slide 111

Slide 111 text

Mul$window, Mul$-Burn-Rate Alerts 10 10 Google - The Site Reliability Workbook, Chapter 5: Aler

Slide 112

Slide 112 text

Mul$window, Mul$-Burn-Rate Alerts 10 10 Google - The Site Reliability Workbook, Chapter 5: Aler

Slide 113

Slide 113 text

113

Slide 114

Slide 114 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 114

Slide 115

Slide 115 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 115

Slide 116

Slide 116 text

Implemen'ng Prometheus configs in Jsonnet • Jsonnet11 is a data templa0ng language • Simple extension of JSON • Eliminate duplica0on with object-orienta0on 11 google/jsonnet: Jsonnet - The data templa5ng language SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 116

Slide 117

Slide 117 text

Prometheus config structure in Jsonnent $ tree prometheus-config prometheus-config ├── alertmanager.jsonnet ├── alertmanager_templates.jsonnet ├── lib │ ├── alert.libsonnet │ ├── alertmanager.libsonnet │ [...snip...] │ ├── traefik.libsonnet │ └── utils.libsonnet ├── platform.libsonnet ├── prometheus_rules.jsonnet ├── runbooks │ ├── alertmanager-down.md │ ├── blackbox-exporter-down.md │ [...snip...] │ └── ssh-probe-failed.md ├── services │ ├── service1.libsonnet │ └── service2.libsonnet └── services.libsonnet SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 117

Slide 118

Slide 118 text

Aler%ng rule library for Traefik $ cat lib/traefik.libsonnet { [...snip...] traefik_backend_high_error_budget_burn_rate_alert: self.alert { name: 'TraefikBackendHighErrorBudgetBurnRate', summary: '[{{ $labels.backend }} in {{ $labels.environment }}] Traefik backend error budget burn rate is high', description: '[{{ $labels.backend }} in {{ $labels.environment }}] Immediate intervention is required to defend the Uptime SLO', expr: ||| ( environment_backend:traefik_backend_errors_per_request:ratio_rate1h{%(matchers)s} > (14.4*0.001) and environment_backend:traefik_backend_errors_per_request:ratio_rate5m{%(matchers)s} > (14.4*0.001) ) or ( environment_backend:traefik_backend_errors_per_request:ratio_rate6h{%(matchers)s} > (6*0.001) and environment_backend:traefik_backend_errors_per_request:ratio_rate30m{%(matchers)s} > (6*0.001) ) ||| % self, }, [...snip...] } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 118

Slide 119

Slide 119 text

Aler%ng config for service1 $ cat services/service1.libsonnet local resque = import '../lib/resque.libsonnet'; local service = import '../lib/service.libsonnet'; local traefik = import '../lib/traefik.libsonnet'; service { name: 'service1', slack_channel: 'service1-alerts', dashboard: 'https://grafana.example.com./d/service1', components+: [ [...snip...] self.component('traefik') { alerts+: [ self.traefik_backend_high_error_budget_burn_rate_alert { matchers: 'backend="service1", environment="production"', }, self.traefik_backend_high_error_budget_burn_rate_warning_alert { matchers: 'backend="service1", environment="production"', }, ], } + traefik, [...snip...] ], } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 119

Slide 120

Slide 120 text

Cau$on! • Jsonnet is super powerful language to elimiate redundancy • Too DRYed-configura

Slide 121

Slide 121 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 121

Slide 122

Slide 122 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 122

Slide 123

Slide 123 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega

Slide 124

Slide 124 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega

Slide 125

Slide 125 text

125

Slide 126

Slide 126 text

Strategy to make new team free from on-call pressure SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 126

Slide 127

Slide 127 text

Fallback to search-v1 when circuit breaker is open • Proxy par*al requests to search-v2 in feature toggle • Strict circuit breaking threshold (No SLO or extreamely low SLO) and fail fast when upstream is unstable • Rescue all errors in feature toggle • Fallback all requests to search-v1 when circuit breaker returns 503s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 127

Slide 128

Slide 128 text

128

Slide 129

Slide 129 text

129

Slide 130

Slide 130 text

130

Slide 131

Slide 131 text

131

Slide 132

Slide 132 text

On-call is not necessary in new team 132

Slide 133

Slide 133 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega

Slide 134

Slide 134 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega

Slide 135

Slide 135 text

Implementa)on pa,ern • API Gateway (BFF) for mobile apps with JWT • Feature toggle + path-based rouBng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 135

Slide 136

Slide 136 text

BFF pa&ern for mobile clients in Cookpad 12 12 Cookpad Developers' Blog, ϞμϯBFFΛ׆༻ͨ͠طଘAPIαʔόʔͷ࠶ߏங SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 136

Slide 137

Slide 137 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 137

Slide 138

Slide 138 text

Prod endpoint + feature toggle + path-based rou5ng • Specify shared single ML API endpoint in feature toggle • Strict circuit breaking threshold (No SLO) and fail fast when upstream is unstable • Rescue all errors in feature toggle and dismiss • Change desDnaDon for each ML API based on request path SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 138

Slide 139

Slide 139 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 139

Slide 140

Slide 140 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 140

Slide 141

Slide 141 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 141

Slide 142

Slide 142 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega

Slide 143

Slide 143 text

Goals for the SRE team As-Is To-Be Approach Developers use restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega

Slide 144

Slide 144 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 144

Slide 145

Slide 145 text

Recap SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 145

Slide 146

Slide 146 text

SRE exper)se and circuit breaker • Protect microservices from unreliable microservice • Enforce contracts(alignment) among teams • Provide on-call free environment for new team • Enable developers to release experimental features • Reduce unproduc=ve communica=on among teams SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 146

Slide 147

Slide 147 text

Bonus talk SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 147

Slide 148

Slide 148 text

What is the best on-call rota0on? • It really depends on your team members • Someone loves weekly rota6on • Someone loves daily rota6on • Someone loves on-call on weekends • Don't create organiza6on-wide rota6on rule 13 13 Well designed policy about on-call compensa6on is necessary to achieve this SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 148

Slide 149

Slide 149 text

On-call rota+on strategy in Cookpad • Don't page with events which don't damage our SLO • Use advantages of ;me-zone differences and distributed team14 • SREs and developers collaborate closely to fix problems 14 Strategy for two-/er on-call rota/on, h5ps:/ /blog.takanabe.tokyo SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 149

Slide 150

Slide 150 text

On-call rota+on in Ruby backend team SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 150

Slide 151

Slide 151 text

On-call rota+on in SRE team • Hybrid strategy to use advantages of 3me-zone differences • JP(UTC+9) & UK(UTC+0) business hour shiF • Daily off-hours rota3on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 151

Slide 152

Slide 152 text

+ Incident evacua-on drill (≠Chaos engineering) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 152

Slide 153

Slide 153 text

SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 153

Slide 154

Slide 154 text

How can we introduce SRE in organiza3on? If you tackle to introduce the SRE methodology and culture with bo9om-up approaches, • Start from a small thing • Find your buddy from product develop teams who are happy to support your ideas • Provide incen;ve to your product developers • SREs are responsible for primary on-call if your services achieve your SLO standard (e.g: 99.99 % avaiability) for a month • Find win-win strategy for developers and SREs • Don't throw SRE sales pitch • Don't play "SRE is one of the Google best prac;ces" cards • We should seriously provide benefits to organiza;on with SRE methodologies (Why do we need SLO? What benefits do we have?) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 154

Slide 155

Slide 155 text

Achievements • Improvement of produc0on stability • Apply SRE technique to real service • Release of machine learning integrated search in produc0on 15 • Release of machine learning oriented infrastruture 15 Vector scoring for term embeddings in Elas5csearch - Speaker Deck SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 155

Slide 156

Slide 156 text

What's next? • Promote SRE culture with ba4le-tested methodologies • Providing JWT auth endpoint for ML and other microservices • Machine learning researchers want to provide services that will be consumed by beta builds of mobile applicaCons • Monolith doesn't need frequent code changes for ML experiences • Monolith doesn't have to proxy anything (this sounds worry SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 156

Slide 157

Slide 157 text

Thank you SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 157