SRE NEXT 2020 [C6] Designing fault-tolerant microservices with SRE and circuit breaker centric architecture

SRE NEXT 2020 [C6] Designing fault-tolerant microservices with SRE and circuit breaker centric architecture

The deck for the talk in SRE NEXT 2020 (https://sre-next.dev/schedule#c6)

Transcript

  1. SRE NEXT 2020 Designing fault-tolerant microservices with SRE and circuit

    breaker centric architecture Takayuki Watanabe Cookpad Inc. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe
  2. Who? Name: Takayuki Watanabe Affiliation: Cookpad Inc. Job: Site Reliability

    Engineering Chapter Lead Sns: Blog: blog.takanabe.tokyo GitHub: takanabe Twitter: @takanabe_w Interests: - Chaos Engineering - Distributed Systems - Resilience Engineering SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 2
  3. Menu • About Cookpad Global • Search-v2 and ML APIs

    • Gaps: ideal and reality • Designing fault-tolerant microservices with SRE and circuit breaker centric architecture SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 3
  4. Out of scope • Monolith vs SOA vs Microservices •

    So2ware design and development in Cloud Na<ve Era • Container orchestrators: Why ECS? Why EKS(k8s)? • Explana<on of fundamental SRE words (e.g: SLO, SLI, Error budget) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 4
  5. About Cookpad Global SRE NEXT 2020 (Jan 25, 2020) /

    Takayuki Watanabe 5
  6. Cookpad Global by numbers • 42,700,000 monthly users • 3,160,000

    recipes • 74 countries • 32 languages SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 6
  7. Cookpad Global by numbers • 1 monolith + 7 microservices

    in produc5on • 300+ spot instances for ECS clusters • 400+ deployments per ECS task defini5on per day • 20 deployements to produc5on per day 7
  8. Cookpad Global by numbers • 23 backend developers (Ruby:19, Python:4)

    • 5 Site Reliability Engineers 8
  9. See more details on Speaker Deck ... 1,2 2 Cookpad

    TechConf 2019, Challenges for Global Service from a Perspec>ve of SRE ~ 2nd season ~ 1 Cookpad TechConf 2018, Challenges for Global Service from a Perspec>ve of SRE SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 9
  10. Go back to 2019... SRE NEXT 2020 (Jan 25, 2020)

    / Takayuki Watanabe 10
  11. Make everyday cooking fun! SRE NEXT 2020 (Jan 25, 2020)

    / Takayuki Watanabe 11
  12. Search is essen+al3 3 Go Global - #CookpadTechconf 2017 SRE

    NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 12
  13. Can users reach the best recipes out of 3,160,000 recipes?

    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 13
  14. No... SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe

    14
  15. Search-v2 and ML APIs SRE NEXT 2020 (Jan 25, 2020)

    / Takayuki Watanabe 15
  16. Search-v2 and ML APIs • Search-v2: people can meet their

    favorite recipes for cooking • (e.g) Personalized search, visual search, recommenda@ons • ML APIs: Other APIs can provide machine learning integrated features • (e.g) Image enhancement, image to recipe SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 16
  17. got it. SRE NEXT 2020 (Jan 25, 2020) / Takayuki

    Watanabe 17
  18. So, who develops them? SRE NEXT 2020 (Jan 25, 2020)

    / Takayuki Watanabe 18
  19. 19

  20. 20

  21. Machine learning researcher ≠ SWE in machine learning SRE NEXT

    2020 (Jan 25, 2020) / Takayuki Watanabe 21
  22. 22

  23. 4 search/machine learning integra0on engineers joined 23

  24. 24

  25. Everthing goes smoothly!! SRE NEXT 2020 (Jan 25, 2020) /

    Takayuki Watanabe 25
  26. Everthing goes smoothly!! SRE NEXT 2020 (Jan 25, 2020) /

    Takayuki Watanabe 26
  27. Gaps: ideal and reality SRE NEXT 2020 (Jan 25, 2020)

    / Takayuki Watanabe 27
  28. Gaps Organiza(on & technology stack SRE NEXT 2020 (Jan 25,

    2020) / Takayuki Watanabe 28
  29. 29

  30. Microservice architecture = Each team can use any technology we

    want SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 30
  31. Microservice architecture = Each team can use any technology we

    want SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 31
  32. Do we finish decoupling monolith to microservices? SRE NEXT 2020

    (Jan 25, 2020) / Takayuki Watanabe 32
  33. Do we have enough developers? SRE NEXT 2020 (Jan 25,

    2020) / Takayuki Watanabe 33
  34. Can we transfer internal resources and knowledge to other teams?

    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 34
  35. Need more efforts to gain benetfits from microservice architecture SRE

    NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 35
  36. We restrict technology stack we use SRE NEXT 2020 (Jan

    25, 2020) / Takayuki Watanabe 36
  37. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 37

  38. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 38

  39. Is it possible to develop search-v2/ML APIs with those tech

    stacks? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 39
  40. Break barriers. Otherwise, no future SRE NEXT 2020 (Jan 25,

    2020) / Takayuki Watanabe 40
  41. As-Is Developers use restricted technology stack To-Be Search/ML team can

    use mainstream technology stack for their fields SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 41
  42. Gaps Expecta(on against service level SRE NEXT 2020 (Jan 25,

    2020) / Takayuki Watanabe 42
  43. 43

  44. 44

  45. This service is experimental This service is beta This service

    is prototype This service is [ANY EXPRESSIONS] 45
  46. Low service level APIs poten2ally cause cascading outages SRE NEXT

    2020 (Jan 25, 2020) / Takayuki Watanabe 46
  47. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 47

  48. As-Is Produc'on is down due to outages of new microservices

    To-Be No produc)on outages due to low service level microservices SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 48
  49. Gaps Team capacity SRE NEXT 2020 (Jan 25, 2020) /

    Takayuki Watanabe 49
  50. 50

  51. Does team have enough capacity for on-call? "Assuming that there

    are always two people on-call (primary and secondary, with different du:es), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shi?s, each engineer is on-call (primary or secondary) for one week every month." 4 "For produc7on on-call responsibili7es, I’ve found that two-7er 24/7 support requires eight engineers. As teams holding their own pagers have become increasingly mainstream, this has become an important sizing constraint, and I try to ensure that every engineering team’s steady state is eight people" 5 5 Larson, Will. An Elegant Puzzle: Systems of Engineering Management, 2.1 Sizing teams (p.33) 4 Google - Site Reliability Engineering Chapter 11 - Being On-Call SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 51
  52. As-Is People have to be responsible for on-call rota0ons for

    new mircorservices To-Be New search/ml team must be free from on-call pressures for their new microservices SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 52
  53. Gaps Knowledge for product development SRE NEXT 2020 (Jan 25,

    2020) / Takayuki Watanabe 53
  54. 54

  55. As-Is Many teams need tough nego)a)ons to release ML related

    features To-Be ML team can release experimental features with light process in produc0on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 55
  56. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Produc6on outages due to new microservices No produc@on outages due to low service level microservices People have to be responsible for on- call rota6ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Many teams need tough nego6a6ons to release ML related features ML team can release experimental features with light process in produc@on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 56
  57. Designing fault-tolerant microservices with SRE and circuit breaker centric architecture

    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 57
  58. 58

  59. Design Docs • Reach consensus against scopes and expecta2ons 6

    • In Cookpad, only SRE team knows en2re system designs 7 7 Google, The Site Reliability Workbook, Chapter 7 - Simplicity 6 Google, Site Reliability Engineering, Chapter 31 - Communica<on and Collabora<on in SRE SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 59
  60. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 60

  61. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 61

  62. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Produc=on outages due to new microservices No produc=on outages due to low service level microservices Design document People have to be responsible for on-call rota=ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego=a=ons to release ML related features ML team can release experimental features with light process in produc=on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 62
  63. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document + ? Produc8on outages due to new microservices No produc8on outages due to low service level microservices Design document People have to be responsible for on- call rota8ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego8a8ons to release ML related features ML team can release experimental features with light process in produc8on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 63
  64. Approach SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe

    64
  65. Delega&on and resource isola&on SRE NEXT 2020 (Jan 25, 2020)

    / Takayuki Watanabe 65
  66. Resource isola,on = AWS resource isola/on SRE NEXT 2020 (Jan

    25, 2020) / Takayuki Watanabe 66
  67. 67

  68. Implementa)on pa,ern • IAM (delega,on level: low) • IAM Permissions

    Boundary (delega,on level: medium) • Dedecated AWS account (delega,on level: high) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 68
  69. Dedicated AWS account • Use AWS Organiza0ons to issue new

    AWS account • Design network by SRE • Build VPC peering between new and old VPCs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 69
  70. 70

  71. Search/ML team can use mainstream technology for their fields 71

  72. Transparent security and audit support • Enforce managed audit and

    security service on AWS • VPCFlowLogs • CloudTrail • GuardDuty • AWS Config 72
  73. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega1on and resource isola1on Produc=on outages due to new microservices No produc=on outages due to low service level microservices Design document People have to be responsible for on-call rota=ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego=a=ons to release ML related features ML team can release experimental features with light process in produc=on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 73
  74. Don't accept excep,ons • We only have 3 SREs (in

    2019) • Follow the boundary we define in the design document • Don't share servers managed by SRE team • Use SaaS to accelerate minimum product development cycles 8 • e.g: CI • e.g: Observability 8 Prac'cal Monitoring: Effec've Strategies for the Real World, Chapter 2.3 PaAern #3: Buy, Not Build SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 74
  75. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc2on outages due to low service level microservices Design document + ? People have to be responsible for on-call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 75
  76. 76

  77. Disconnec(ng unstable produc(on microservices makes sense 77

  78. Approach SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe

    78
  79. Circuit breaker centric architecture SRE NEXT 2020 (Jan 25, 2020)

    / Takayuki Watanabe 79
  80. Why Circuit Breaker? • Fail fast strategy to prevent cascading

    failures • Limits external service and network impacts • Don’t waste capacity calling a broken service • External service is slow • External service is down • Network is unstable SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 80
  81. State transi*on diagram SRE NEXT 2020 (Jan 25, 2020) /

    Takayuki Watanabe 81
  82. Case study SRE NEXT 2020 (Jan 25, 2020) / Takayuki

    Watanabe 82
  83. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 83

  84. • Closed • Traffic flows normaly • Health is assessed

    every 100ms based on a 10s rolling average • Open / Tripped • Fail fast - return 503 error • Stays in this state for 10s • Recovering / Half Open • Ramp up traffic over 10s • Check health every 100ms -> if fail go back to Open state • Return to Closed if health is OK aJer 10s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 84
  85. Circuit Breaker - Implica1ons • We can introduce experimental and

    new services with less risk to other parts of the applica8on • Slow responses ~= Outage! • Fallback strategies become more important • Add values to use SLOs for communica8on tools about service levels SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 85
  86. Implementa)on pa,ern • Applica(on library (e.g: cookpad/expeditor, Ne;lix/Hystrix) • Proxy

    (e.g: Envoy Proxy, Traefik) • Service Mesh (e.g: Is(o, Maesh) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 86
  87. Circuit breaker proxy side-car container • Use a L7 reverse

    proxy with circuit breaking middleware • Each microservice has it's own independently configured circuit breaker • Run as a sidecar container SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 87
  88. Traefik as circuit breaker proxy • NetworkErrorRa+o • Covers networking

    errors connec0ng to the service • Shedding load can help some errors to recover! • ResponseCodeRa+o • Don’t bother calling broken serivice • LatencyAtQuan+leMS • Isolate slow services. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 88
  89. Traefik configura-on example { service1: { backend: 'http://service1_endpoint', circuit_breaker: "LatencyAtQuantileMS(50.0)

    > 1000 || ResponseCodeRatio(500, 600, 0, 600) > 0.30 || NetworkErrorRatio() > 0.10", }, service2: { backend: 'http://service2_endpoint', circuit_breaker: "LatencyAtQuantileMS(50.0) > 3000 || ResponseCodeRatio(500, 600, 0, 600) > 0.10 || NetworkErrorRatio() > 0.10", }, } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 89
  90. 90

  91. How do we decide threshold? 91

  92. SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe

    92
  93. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 93

  94. Can developers define SLO? SRE NEXT 2020 (Jan 25, 2020)

    / Takayuki Watanabe 94
  95. Availability class SRE NEXT 2020 (Jan 25, 2020) / Takayuki

    Watanabe 95
  96. Availability class • We customize produc0on readiness check as availablity

    class (a.k.a produc0on readiness review 9) 9 Google - Site Reliability Engineering, Chapter 32 - The Evolving SRE Engagement Model SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 96
  97. Availability class presets • Baseline • Medium • High •

    No SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 97
  98. Baseline availability class Availablity Target: > 95% Period Down*me Budget

    Daily 1h 12m Weekly 8h 24 Monthly 36h 31m SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 98
  99. Medium availability class Availablity Target: > 99% Period Down*me Budget

    Daily 14m 24s Weekly 1h 41m Monthly 7h 18m SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 99
  100. High availability class Availablity Target: > 99.9% Period Down*me Budget

    Daily 1m 26s Weekly 10m 4s Monthly 43m 49s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 100
  101. 101

  102. 102

  103. How do we know the service level? SRE NEXT 2020

    (Jan 25, 2020) / Takayuki Watanabe 103
  104. Aler%ng on SLO SRE NEXT 2020 (Jan 25, 2020) /

    Takayuki Watanabe 104
  105. Implemen'ng alerts on SLO There are several strategies to implement

    alerts on SLO 10 • Target Error Rate ≥ SLO Threshold • Increased Alert Window • Incremen<ng Alert Dura<on • Alert on Burn Rate • Mul<ple Burn Rate Alerts • Mul<window, Mul<-Burn-Rate Alerts 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 105
  106. Implemen'ng alerts on SLO There are several strategies to implement

    alerts on SLO 10 • Target Error Rate ≥ SLO Threshold • Increased Alert Window • Incremen<ng Alert Dura<on • Alert on Burn Rate • Mul<ple Burn Rate Alerts • Mul$window, Mul$-Burn-Rate Alerts 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 106
  107. Burn rate Burn rate is how fast a service consumes

    the error budget on SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 107
  108. Burn rates and +me to complete budget exhaus+on 10 Burn

    rate Error rate for 99.9% SLO Time to exhaus8on 1 0.1% 30 days 2 0.2% 15 days 10 1% 3 days 1000 100% 43minutes 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 108
  109. Burn rates and +me to complete budget exhaus+on 10 10

    Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 109
  110. Mul$window, Mul$-Burn-Rate Alerts 10 • This approach provides good precision

    alerts and reduce the number of false posi7ves • Make the short window 1/12 the dura7on of the long window as the star7ng point Severity No*fica*on Long window Short window Burn rate Error budget consumed Cri$cal Pager 1 hour 5 minutes 14.4 2% Cri$cal Pager 6 hour 30 minutes 6 5% Warning Chat, $cket 3 days 6 hours 1 10% 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 110
  111. Mul$window, Mul$-Burn-Rate Alerts 10 10 Google - The Site Reliability

    Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 111
  112. Mul$window, Mul$-Burn-Rate Alerts 10 10 Google - The Site Reliability

    Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 112
  113. 113

  114. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 114

  115. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 115

  116. Implemen'ng Prometheus configs in Jsonnet • Jsonnet11 is a data

    templa0ng language • Simple extension of JSON • Eliminate duplica0on with object-orienta0on 11 google/jsonnet: Jsonnet - The data templa5ng language SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 116
  117. Prometheus config structure in Jsonnent $ tree prometheus-config prometheus-config ├──

    alertmanager.jsonnet ├── alertmanager_templates.jsonnet ├── lib │ ├── alert.libsonnet │ ├── alertmanager.libsonnet │ [...snip...] │ ├── traefik.libsonnet │ └── utils.libsonnet ├── platform.libsonnet ├── prometheus_rules.jsonnet ├── runbooks │ ├── alertmanager-down.md │ ├── blackbox-exporter-down.md │ [...snip...] │ └── ssh-probe-failed.md ├── services │ ├── service1.libsonnet │ └── service2.libsonnet └── services.libsonnet SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 117
  118. Aler%ng rule library for Traefik $ cat lib/traefik.libsonnet { [...snip...]

    traefik_backend_high_error_budget_burn_rate_alert: self.alert { name: 'TraefikBackendHighErrorBudgetBurnRate', summary: '[{{ $labels.backend }} in {{ $labels.environment }}] Traefik backend error budget burn rate is high', description: '[{{ $labels.backend }} in {{ $labels.environment }}] Immediate intervention is required to defend the Uptime SLO', expr: ||| ( environment_backend:traefik_backend_errors_per_request:ratio_rate1h{%(matchers)s} > (14.4*0.001) and environment_backend:traefik_backend_errors_per_request:ratio_rate5m{%(matchers)s} > (14.4*0.001) ) or ( environment_backend:traefik_backend_errors_per_request:ratio_rate6h{%(matchers)s} > (6*0.001) and environment_backend:traefik_backend_errors_per_request:ratio_rate30m{%(matchers)s} > (6*0.001) ) ||| % self, }, [...snip...] } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 118
  119. Aler%ng config for service1 $ cat services/service1.libsonnet local resque =

    import '../lib/resque.libsonnet'; local service = import '../lib/service.libsonnet'; local traefik = import '../lib/traefik.libsonnet'; service { name: 'service1', slack_channel: 'service1-alerts', dashboard: 'https://grafana.example.com./d/service1', components+: [ [...snip...] self.component('traefik') { alerts+: [ self.traefik_backend_high_error_budget_burn_rate_alert { matchers: 'backend="service1", environment="production"', }, self.traefik_backend_high_error_budget_burn_rate_warning_alert { matchers: 'backend="service1", environment="production"', }, ], } + traefik, [...snip...] ], } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 119
  120. Cau$on! • Jsonnet is super powerful language to elimiate redundancy

    • Too DRYed-configura<ons is difficult to maintain • We have to control the power and make configura<ons simple SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 120
  121. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 121

  122. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 122

  123. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 123
  124. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document + ? Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 124
  125. 125

  126. Strategy to make new team free from on-call pressure SRE

    NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 126
  127. Fallback to search-v1 when circuit breaker is open • Proxy

    par*al requests to search-v2 in feature toggle • Strict circuit breaking threshold (No SLO or extreamely low SLO) and fail fast when upstream is unstable • Rescue all errors in feature toggle • Fallback all requests to search-v1 when circuit breaker returns 503s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 127
  128. 128

  129. 129

  130. 130

  131. 131

  132. On-call is not necessary in new team 132

  133. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 133
  134. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document + ? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 134
  135. Implementa)on pa,ern • API Gateway (BFF) for mobile apps with

    JWT • Feature toggle + path-based rouBng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 135
  136. BFF pa&ern for mobile clients in Cookpad 12 12 Cookpad

    Developers' Blog, ϞμϯBFFΛ׆༻ͨ͠طଘAPIαʔόʔͷ࠶ߏங SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 136
  137. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 137

  138. Prod endpoint + feature toggle + path-based rou5ng • Specify

    shared single ML API endpoint in feature toggle • Strict circuit breaking threshold (No SLO) and fail fast when upstream is unstable • Rescue all errors in feature toggle and dismiss • Change desDnaDon for each ML API based on request path SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 138
  139. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 139

  140. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 140

  141. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 141

  142. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SLO (No SLO) Circuit breaker Feature toggle + Path-based rouAng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 142
  143. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SLO (No SLO) Circuit breaker Feature toggle + Path-based rou<ng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 143
  144. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 144

  145. Recap SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe

    145
  146. SRE exper)se and circuit breaker • Protect microservices from unreliable

    microservice • Enforce contracts(alignment) among teams • Provide on-call free environment for new team • Enable developers to release experimental features • Reduce unproduc=ve communica=on among teams SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 146
  147. Bonus talk SRE NEXT 2020 (Jan 25, 2020) / Takayuki

    Watanabe 147
  148. What is the best on-call rota0on? • It really depends

    on your team members • Someone loves weekly rota6on • Someone loves daily rota6on • Someone loves on-call on weekends • Don't create organiza6on-wide rota6on rule 13 13 Well designed policy about on-call compensa6on is necessary to achieve this SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 148
  149. On-call rota+on strategy in Cookpad • Don't page with events

    which don't damage our SLO • Use advantages of ;me-zone differences and distributed team14 • SREs and developers collaborate closely to fix problems 14 Strategy for two-/er on-call rota/on, h5ps:/ /blog.takanabe.tokyo SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 149
  150. On-call rota+on in Ruby backend team SRE NEXT 2020 (Jan

    25, 2020) / Takayuki Watanabe 150
  151. On-call rota+on in SRE team • Hybrid strategy to use

    advantages of 3me-zone differences • JP(UTC+9) & UK(UTC+0) business hour shiF • Daily off-hours rota3on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 151
  152. + Incident evacua-on drill (≠Chaos engineering) SRE NEXT 2020 (Jan

    25, 2020) / Takayuki Watanabe 152
  153. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 153

  154. How can we introduce SRE in organiza3on? If you tackle

    to introduce the SRE methodology and culture with bo9om-up approaches, • Start from a small thing • Find your buddy from product develop teams who are happy to support your ideas • Provide incen;ve to your product developers • SREs are responsible for primary on-call if your services achieve your SLO standard (e.g: 99.99 % avaiability) for a month • Find win-win strategy for developers and SREs • Don't throw SRE sales pitch • Don't play "SRE is one of the Google best prac;ces" cards • We should seriously provide benefits to organiza;on with SRE methodologies (Why do we need SLO? What benefits do we have?) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 154
  155. Achievements • Improvement of produc0on stability • Apply SRE technique

    to real service • Release of machine learning integrated search in produc0on 15 • Release of machine learning oriented infrastruture 15 Vector scoring for term embeddings in Elas5csearch - Speaker Deck SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 155
  156. What's next? • Promote SRE culture with ba4le-tested methodologies •

    Providing JWT auth endpoint for ML and other microservices • Machine learning researchers want to provide services that will be consumed by beta builds of mobile applicaCons • Monolith doesn't need frequent code changes for ML experiences • Monolith doesn't have to proxy anything (this sounds worry SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 156
  157. Thank you SRE NEXT 2020 (Jan 25, 2020) / Takayuki

    Watanabe 157