Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE NEXT 2020 [C6] Designing fault-tolerant microservices with SRE and circuit breaker centric architecture

SRE NEXT 2020 [C6] Designing fault-tolerant microservices with SRE and circuit breaker centric architecture

The deck for the talk in SRE NEXT 2020 (https://sre-next.dev/schedule#c6)

More Decks by Takayuki WATANABE (渡辺 喬之)

Other Decks in Technology

Transcript

  1. SRE NEXT 2020
    Designing fault-tolerant microservices with
    SRE and circuit breaker centric architecture
    Takayuki Watanabe
    Cookpad Inc.
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe

    View Slide

  2. Who?
    Name: Takayuki Watanabe
    Affiliation: Cookpad Inc.
    Job: Site Reliability Engineering Chapter Lead
    Sns:
    Blog: blog.takanabe.tokyo
    GitHub: takanabe
    Twitter: @takanabe_w
    Interests:
    - Chaos Engineering
    - Distributed Systems
    - Resilience Engineering
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 2

    View Slide

  3. Menu
    • About Cookpad Global
    • Search-v2 and ML APIs
    • Gaps: ideal and reality
    • Designing fault-tolerant microservices with SRE and circuit
    breaker centric architecture
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 3

    View Slide

  4. Out of scope
    • Monolith vs SOA vs Microservices
    • So2ware design and development in Cloud Na• Container orchestrators: Why ECS? Why EKS(k8s)?
    • Explanabudget)
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 4

    View Slide

  5. About Cookpad Global
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 5

    View Slide

  6. Cookpad Global by numbers
    • 42,700,000 monthly users
    • 3,160,000 recipes
    • 74 countries
    • 32 languages
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 6

    View Slide

  7. Cookpad Global by numbers
    • 1 monolith + 7 microservices in produc5on
    • 300+ spot instances for ECS clusters
    • 400+ deployments per ECS task defini5on per day
    • 20 deployements to produc5on per day
    7

    View Slide

  8. Cookpad Global by numbers
    • 23 backend developers (Ruby:19, Python:4)
    • 5 Site Reliability Engineers
    8

    View Slide

  9. See more details on Speaker Deck ... 1,2
    2 Cookpad TechConf 2019, Challenges for Global Service from a Perspec>ve of SRE ~ 2nd season ~
    1 Cookpad TechConf 2018, Challenges for Global Service from a Perspec>ve of SRE
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 9

    View Slide

  10. Go back to 2019...
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 10

    View Slide

  11. Make everyday cooking fun!
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 11

    View Slide

  12. Search is essen+al3
    3 Go Global - #CookpadTechconf 2017
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 12

    View Slide

  13. Can users reach the best recipes out
    of 3,160,000 recipes?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 13

    View Slide

  14. No...
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 14

    View Slide

  15. Search-v2 and ML APIs
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 15

    View Slide

  16. Search-v2 and ML APIs
    • Search-v2: people can meet their favorite recipes for cooking
    • (e.g) Personalized search, visual search, [email protected]
    • ML APIs: Other APIs can provide machine learning integrated
    features
    • (e.g) Image enhancement, image to recipe
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 16

    View Slide

  17. got it.
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 17

    View Slide

  18. So, who develops them?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 18

    View Slide

  19. 19

    View Slide

  20. 20

    View Slide

  21. Machine learning researcher
    ≠ SWE in machine learning
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 21

    View Slide

  22. 22

    View Slide

  23. 4 search/machine learning
    integra0on engineers joined
    23

    View Slide

  24. 24

    View Slide

  25. Everthing goes smoothly!!
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 25

    View Slide

  26. Everthing goes smoothly!!
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 26

    View Slide

  27. Gaps: ideal and reality
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 27

    View Slide

  28. Gaps
    Organiza(on & technology stack
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 28

    View Slide

  29. 29

    View Slide

  30. Microservice architecture
    = Each team can use any technology we want
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 30

    View Slide

  31. Microservice architecture
    = Each team can use any technology we want
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 31

    View Slide

  32. Do we finish decoupling monolith to
    microservices?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 32

    View Slide

  33. Do we have enough developers?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 33

    View Slide

  34. Can we transfer internal resources and
    knowledge to other teams?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 34

    View Slide

  35. Need more efforts to gain benetfits from
    microservice architecture
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 35

    View Slide

  36. We restrict technology stack we use
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 36

    View Slide

  37. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 37

    View Slide

  38. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 38

    View Slide

  39. Is it possible to develop search-v2/ML APIs
    with those tech stacks?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 39

    View Slide

  40. Break barriers. Otherwise, no future
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 40

    View Slide

  41. As-Is
    Developers use restricted technology stack
    To-Be
    Search/ML team can use mainstream technology stack for their
    fields
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 41

    View Slide

  42. Gaps
    Expecta(on against service level
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 42

    View Slide

  43. 43

    View Slide

  44. 44

    View Slide

  45. This service is experimental
    This service is beta
    This service is prototype
    This service is [ANY EXPRESSIONS]
    45

    View Slide

  46. Low service level APIs poten2ally
    cause cascading outages
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 46

    View Slide

  47. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 47

    View Slide

  48. As-Is
    Produc'on is down due to outages of new microservices
    To-Be
    No produc)on outages due to low service level microservices
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 48

    View Slide

  49. Gaps
    Team capacity
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 49

    View Slide

  50. 50

    View Slide

  51. Does team have enough capacity for on-call?
    "Assuming that there are always two people on-call (primary and secondary, with
    different du:es), the minimum number of engineers needed for on-call duty from a
    single-site team is eight: assuming week-long shi?s, each engineer is on-call
    (primary or secondary) for one week every month." 4
    "For produc7on on-call responsibili7es, I’ve found that two-7er 24/7 support
    requires eight engineers. As teams holding their own pagers have become
    increasingly mainstream, this has become an important sizing constraint, and I try to
    ensure that every engineering team’s steady state is eight people" 5
    5 Larson, Will. An Elegant Puzzle: Systems of Engineering Management, 2.1 Sizing teams (p.33)
    4 Google - Site Reliability Engineering Chapter 11 - Being On-Call
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 51

    View Slide

  52. As-Is
    People have to be responsible for on-call rota0ons for new
    mircorservices
    To-Be
    New search/ml team must be free from on-call pressures for their
    new microservices
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 52

    View Slide

  53. Gaps
    Knowledge for product development
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 53

    View Slide

  54. 54

    View Slide

  55. As-Is
    Many teams need tough nego)a)ons to release ML related
    features
    To-Be
    ML team can release experimental features with light process in
    produc0on
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 55

    View Slide

  56. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted technology
    stack
    Search/ML team can use
    mainstream technology stack for
    their fields
    Produc6on outages due to new
    microservices
    No [email protected] outages due to low
    service level microservices
    People have to be responsible for on-
    call rota6ons for new mircorservices
    New search/ml team must be free
    from on-call pressures for their new
    microservices
    Many teams need tough nego6a6ons
    to release ML related features
    ML team can release experimental
    features with light process in
    [email protected]
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 56

    View Slide

  57. Designing fault-tolerant microservices
    with SRE and circuit breaker centric
    architecture
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 57

    View Slide

  58. 58

    View Slide

  59. Design Docs
    • Reach consensus against scopes and expecta2ons 6
    • In Cookpad, only SRE team knows en2re system designs 7
    7 Google, The Site Reliability Workbook, Chapter 7 - Simplicity
    6 Google, Site Reliability Engineering, Chapter 31 - CommunicaSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 59

    View Slide

  60. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 60

    View Slide

  61. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 61

    View Slide

  62. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted
    technology stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    Produc=on outages due to new
    microservices
    No produc=on outages due to low
    service level microservices
    Design document
    People have to be responsible for
    on-call rota=ons for new
    mircorservices
    New search/ml team must be free
    from on-call pressures for their new
    microservices
    Design document
    Many teams need tough
    nego=a=ons to release ML related
    features
    ML team can release experimental
    features with light process in
    produc=on
    Design document
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 62

    View Slide

  63. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted technology
    stack
    Search/ML team can use
    mainstream technology stack for
    their fields
    Design document + ?
    Produc8on outages due to new
    microservices
    No produc8on outages due to low
    service level microservices
    Design document
    People have to be responsible for on-
    call rota8ons for new mircorservices
    New search/ml team must be free
    from on-call pressures for their new
    microservices
    Design document
    Many teams need tough nego8a8ons
    to release ML related features
    ML team can release experimental
    features with light process in
    produc8on
    Design document
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 63

    View Slide

  64. Approach
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 64

    View Slide

  65. Delega&on and resource isola&on
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 65

    View Slide

  66. Resource isola,on
    = AWS resource isola/on
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 66

    View Slide

  67. 67

    View Slide

  68. Implementa)on pa,ern
    • IAM (delega,on level: low)
    • IAM Permissions Boundary (delega,on level: medium)
    • Dedecated AWS account (delega,on level: high)
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 68

    View Slide

  69. Dedicated AWS account
    • Use AWS Organiza0ons to issue new AWS account
    • Design network by SRE
    • Build VPC peering between new and old VPCs
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 69

    View Slide

  70. 70

    View Slide

  71. Search/ML team can use mainstream
    technology for their fields
    71

    View Slide

  72. Transparent security and audit support
    • Enforce managed audit and security service on AWS
    • VPCFlowLogs
    • CloudTrail
    • GuardDuty
    • AWS Config
    72

    View Slide

  73. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted
    technology stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    Delega1on and resource isola1on
    Produc=on outages due to new
    microservices
    No produc=on outages due to low
    service level microservices
    Design document
    People have to be responsible for
    on-call rota=ons for new
    mircorservices
    New search/ml team must be free
    from on-call pressures for their new
    microservices
    Design document
    Many teams need tough
    nego=a=ons to release ML related
    features
    ML team can release experimental
    features with light process in
    produc=on
    Design document
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 73

    View Slide

  74. Don't accept excep,ons
    • We only have 3 SREs (in 2019)
    • Follow the boundary we define in the design document
    • Don't share servers managed by SRE team
    • Use SaaS to accelerate minimum product development cycles 8
    • e.g: CI
    • e.g: Observability
    8 Prac'cal Monitoring: Effec've Strategies for the Real World, Chapter 2.3 PaAern #3: Buy, Not Build
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 74

    View Slide

  75. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted
    technology stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    DelegaProducmicroservices
    No produc2on outages due to low
    service level microservices
    Design document + ?
    People have to be responsible for
    on-call rotamircorservices
    New search/ml team must be free
    from on-call pressures for their new
    microservices
    Design document
    Many teams need tough
    negofeatures
    ML team can release experimental
    features with light process in
    producDesign document
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 75

    View Slide

  76. 76

    View Slide

  77. Disconnec(ng unstable produc(on
    microservices makes sense
    77

    View Slide

  78. Approach
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 78

    View Slide

  79. Circuit breaker centric architecture
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 79

    View Slide

  80. Why Circuit Breaker?
    • Fail fast strategy to prevent cascading failures
    • Limits external service and network impacts
    • Don’t waste capacity calling a broken service
    • External service is slow
    • External service is down
    • Network is unstable
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 80

    View Slide

  81. State transi*on diagram
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 81

    View Slide

  82. Case study
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 82

    View Slide

  83. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 83

    View Slide

  84. • Closed
    • Traffic flows normaly
    • Health is assessed every 100ms based on a 10s rolling average
    • Open / Tripped
    • Fail fast - return 503 error
    • Stays in this state for 10s
    • Recovering / Half Open
    • Ramp up traffic over 10s
    • Check health every 100ms -> if fail go back to Open state
    • Return to Closed if health is OK aJer 10s
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 84

    View Slide

  85. Circuit Breaker - Implica1ons
    • We can introduce experimental and new services with less risk to
    other parts of the applica8on
    • Slow responses ~= Outage!
    • Fallback strategies become more important
    • Add values to use SLOs for communica8on tools about service
    levels
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 85

    View Slide

  86. Implementa)on pa,ern
    • Applica(on library (e.g: cookpad/expeditor, Ne;lix/Hystrix)
    • Proxy (e.g: Envoy Proxy, Traefik)
    • Service Mesh (e.g: Is(o, Maesh)
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 86

    View Slide

  87. Circuit breaker proxy side-car container
    • Use a L7 reverse proxy with circuit breaking middleware
    • Each microservice has it's own independently configured circuit
    breaker
    • Run as a sidecar container
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 87

    View Slide

  88. Traefik as circuit breaker proxy
    • NetworkErrorRa+o
    • Covers networking errors connec0ng to the service
    • Shedding load can help some errors to recover!
    • ResponseCodeRa+o
    • Don’t bother calling broken serivice
    • LatencyAtQuan+leMS
    • Isolate slow services.
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 88

    View Slide

  89. Traefik configura-on example
    {
    service1: {
    backend: 'http://service1_endpoint',
    circuit_breaker: "LatencyAtQuantileMS(50.0) > 1000 ||
    ResponseCodeRatio(500, 600, 0, 600) > 0.30 ||
    NetworkErrorRatio() > 0.10",
    },
    service2: {
    backend: 'http://service2_endpoint',
    circuit_breaker: "LatencyAtQuantileMS(50.0) > 3000 ||
    ResponseCodeRatio(500, 600, 0, 600) > 0.10 ||
    NetworkErrorRatio() > 0.10",
    },
    }
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 89

    View Slide

  90. 90

    View Slide

  91. How do we decide threshold?
    91

    View Slide

  92. SLO
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 92

    View Slide

  93. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 93

    View Slide

  94. Can developers define SLO?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 94

    View Slide

  95. Availability class
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 95

    View Slide

  96. Availability class
    • We customize produc0on readiness check as availablity class
    (a.k.a produc0on readiness review 9)
    9 Google - Site Reliability Engineering, Chapter 32 - The Evolving SRE Engagement Model
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 96

    View Slide

  97. Availability class presets
    • Baseline
    • Medium
    • High
    • No SLO
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 97

    View Slide

  98. Baseline availability class
    Availablity Target: > 95%
    Period Down*me Budget
    Daily 1h 12m
    Weekly 8h 24
    Monthly 36h 31m
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 98

    View Slide

  99. Medium availability class
    Availablity Target: > 99%
    Period Down*me Budget
    Daily 14m 24s
    Weekly 1h 41m
    Monthly 7h 18m
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 99

    View Slide

  100. High availability class
    Availablity Target: > 99.9%
    Period Down*me Budget
    Daily 1m 26s
    Weekly 10m 4s
    Monthly 43m 49s
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 100

    View Slide

  101. 101

    View Slide

  102. 102

    View Slide

  103. How do we know the service level?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 103

    View Slide

  104. Aler%ng on SLO
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 104

    View Slide

  105. Implemen'ng alerts on SLO
    There are several strategies to implement alerts on SLO 10
    • Target Error Rate ≥ SLO Threshold
    • Increased Alert Window
    • Incremen• Alert on Burn Rate
    • Mul• Mul10 Google - The Site Reliability Workbook, Chapter 5: AlerSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 105

    View Slide

  106. Implemen'ng alerts on SLO
    There are several strategies to implement alerts on SLO 10
    • Target Error Rate ≥ SLO Threshold
    • Increased Alert Window
    • Incremen• Alert on Burn Rate
    • Mul• Mul$window, Mul$-Burn-Rate Alerts
    10 Google - The Site Reliability Workbook, Chapter 5: AlerSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 106

    View Slide

  107. Burn rate
    Burn rate is how fast a service consumes the error budget on SLO
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 107

    View Slide

  108. Burn rates and +me to complete budget exhaus+on 10
    Burn rate Error rate for 99.9% SLO Time to exhaus8on
    1 0.1% 30 days
    2 0.2% 15 days
    10 1% 3 days
    1000 100% 43minutes
    10 Google - The Site Reliability Workbook, Chapter 5: AlerSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 108

    View Slide

  109. Burn rates and +me to complete budget exhaus+on 10
    10 Google - The Site Reliability Workbook, Chapter 5: AlerSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 109

    View Slide

  110. Mul$window, Mul$-Burn-Rate Alerts 10
    • This approach provides good precision alerts and reduce the number of false posi7ves
    • Make the short window 1/12 the dura7on of the long window as the star7ng point
    Severity No*fica*on Long window Short window Burn rate Error budget
    consumed
    Cri$cal Pager 1 hour 5 minutes 14.4 2%
    Cri$cal Pager 6 hour 30 minutes 6 5%
    Warning Chat, $cket 3 days 6 hours 1 10%
    10 Google - The Site Reliability Workbook, Chapter 5: AlerSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 110

    View Slide

  111. Mul$window, Mul$-Burn-Rate Alerts 10
    10 Google - The Site Reliability Workbook, Chapter 5: AlerSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 111

    View Slide

  112. Mul$window, Mul$-Burn-Rate Alerts 10
    10 Google - The Site Reliability Workbook, Chapter 5: AlerSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 112

    View Slide

  113. 113

    View Slide

  114. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 114

    View Slide

  115. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 115

    View Slide

  116. Implemen'ng Prometheus configs in Jsonnet
    • Jsonnet11 is a data templa0ng language
    • Simple extension of JSON
    • Eliminate duplica0on with object-orienta0on
    11 google/jsonnet: Jsonnet - The data templa5ng language
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 116

    View Slide

  117. Prometheus config structure in Jsonnent
    $ tree prometheus-config
    prometheus-config
    ├── alertmanager.jsonnet
    ├── alertmanager_templates.jsonnet
    ├── lib
    │ ├── alert.libsonnet
    │ ├── alertmanager.libsonnet
    │ [...snip...]
    │ ├── traefik.libsonnet
    │ └── utils.libsonnet
    ├── platform.libsonnet
    ├── prometheus_rules.jsonnet
    ├── runbooks
    │ ├── alertmanager-down.md
    │ ├── blackbox-exporter-down.md
    │ [...snip...]
    │ └── ssh-probe-failed.md
    ├── services
    │ ├── service1.libsonnet
    │ └── service2.libsonnet
    └── services.libsonnet
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 117

    View Slide

  118. Aler%ng rule library for Traefik
    $ cat lib/traefik.libsonnet
    {
    [...snip...]
    traefik_backend_high_error_budget_burn_rate_alert: self.alert {
    name: 'TraefikBackendHighErrorBudgetBurnRate',
    summary: '[{{ $labels.backend }} in {{ $labels.environment }}] Traefik backend error budget burn rate is high',
    description: '[{{ $labels.backend }} in {{ $labels.environment }}] Immediate intervention is required to defend the Uptime SLO',
    expr: |||
    (
    environment_backend:traefik_backend_errors_per_request:ratio_rate1h{%(matchers)s} > (14.4*0.001)
    and
    environment_backend:traefik_backend_errors_per_request:ratio_rate5m{%(matchers)s} > (14.4*0.001)
    )
    or
    (
    environment_backend:traefik_backend_errors_per_request:ratio_rate6h{%(matchers)s} > (6*0.001)
    and
    environment_backend:traefik_backend_errors_per_request:ratio_rate30m{%(matchers)s} > (6*0.001)
    )
    ||| % self,
    },
    [...snip...]
    }
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 118

    View Slide

  119. Aler%ng config for service1
    $ cat services/service1.libsonnet
    local resque = import '../lib/resque.libsonnet';
    local service = import '../lib/service.libsonnet';
    local traefik = import '../lib/traefik.libsonnet';
    service {
    name: 'service1',
    slack_channel: 'service1-alerts',
    dashboard: 'https://grafana.example.com./d/service1',
    components+: [
    [...snip...]
    self.component('traefik') {
    alerts+: [
    self.traefik_backend_high_error_budget_burn_rate_alert {
    matchers: 'backend="service1", environment="production"',
    },
    self.traefik_backend_high_error_budget_burn_rate_warning_alert {
    matchers: 'backend="service1", environment="production"',
    },
    ],
    } + traefik,
    [...snip...]
    ],
    }
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 119

    View Slide

  120. Cau$on!
    • Jsonnet is super powerful language to elimiate redundancy
    • Too DRYed-configura• We have to control the power and make configuraSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 120

    View Slide

  121. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 121

    View Slide

  122. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 122

    View Slide

  123. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted technology
    stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    DelegaProducmicroservices
    No producservice level microservices
    Design document
    SLO
    Circuit breaker
    People have to be responsible for on-
    call rotaNew search/ml team must be free
    from on-call pressures for their new
    microservices
    Design document
    Many teams need tough negoto release ML related features
    ML team can release experimental
    features with light process in
    producDesign document
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 123

    View Slide

  124. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted technology
    stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    DelegaProducmicroservices
    No producservice level microservices
    Design document
    SLO
    Circuit breaker
    People have to be responsible for on-
    call rotaNew search/ml team must be free
    from on-call pressures for their new
    microservices
    Design document + ?
    Many teams need tough negoto release ML related features
    ML team can release experimental
    features with light process in
    producDesign document
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 124

    View Slide

  125. 125

    View Slide

  126. Strategy to make new team free
    from on-call pressure
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 126

    View Slide

  127. Fallback to search-v1 when circuit breaker is open
    • Proxy par*al requests to search-v2 in feature toggle
    • Strict circuit breaking threshold (No SLO or extreamely low SLO)
    and fail fast when upstream is unstable
    • Rescue all errors in feature toggle
    • Fallback all requests to search-v1 when circuit breaker returns
    503s
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 127

    View Slide

  128. 128

    View Slide

  129. 129

    View Slide

  130. 130

    View Slide

  131. 131

    View Slide

  132. On-call is not necessary in new team
    132

    View Slide

  133. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted technology
    stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    DelegaProducmicroservices
    No producservice level microservices
    Design document
    SLO
    Circuit breaker
    People have to be responsible for on-
    call rotaNew search/ml team must be free
    from on-call pressures for their new
    microservices
    Design document
    SLO
    Circuit breaker + Fallback
    Many teams need tough negoto release ML related features
    ML team can release experimental
    features with light process in
    producDesign document
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 133

    View Slide

  134. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted technology
    stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    DelegaProducmicroservices
    No producservice level microservices
    Design document
    SLO
    Circuit breaker
    People have to be responsible for on-
    call rotaNew search/ml team must be free
    from on-call pressures for their new
    microservices
    Design document
    SLO
    Circuit breaker + Fallback
    Many teams need tough negoto release ML related features
    ML team can release experimental
    features with light process in
    producDesign document + ?
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 134

    View Slide

  135. Implementa)on pa,ern
    • API Gateway (BFF) for mobile apps with JWT
    • Feature toggle + path-based rouBng
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 135

    View Slide

  136. BFF pa&ern for mobile clients in Cookpad 12
    12 Cookpad Developers' Blog, ϞμϯBFFΛ׆༻ͨ͠طଘAPIαʔόʔͷ࠶ߏங
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 136

    View Slide

  137. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 137

    View Slide

  138. Prod endpoint + feature toggle + path-based rou5ng
    • Specify shared single ML API endpoint in feature toggle
    • Strict circuit breaking threshold (No SLO) and fail fast when
    upstream is unstable
    • Rescue all errors in feature toggle and dismiss
    • Change desDnaDon for each ML API based on request path
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 138

    View Slide

  139. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 139

    View Slide

  140. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 140

    View Slide

  141. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 141

    View Slide

  142. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted technology
    stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    DelegaProducmicroservices
    No producservice level microservices
    Design document
    SLO
    Circuit breaker
    People have to be responsible for on-
    call rotaNew search/ml team must be free from
    on-call pressures for their new
    microservices
    Design document
    SLO
    Circuit breaker + Fallback
    Many teams need tough negorelease ML related features
    ML team can release experimental
    features with light process in
    producDesign document
    SLO (No SLO)
    Circuit breaker
    Feature toggle + Path-based rouAng
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 142

    View Slide

  143. Goals for the SRE team
    As-Is To-Be Approach
    Developers use restricted technology
    stack
    Search/ML team can use mainstream
    technology stack for their fields
    Design document
    DelegaProducmicroservices
    No producservice level microservices
    Design document
    SLO
    Circuit breaker
    People have to be responsible for on-
    call rotaNew search/ml team must be free from
    on-call pressures for their new
    microservices
    Design document
    SLO
    Circuit breaker + Fallback
    Many teams need tough negorelease ML related features
    ML team can release experimental
    features with light process in
    producDesign document
    SLO (No SLO)
    Circuit breaker
    Feature toggle + Path-based rouSRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 143

    View Slide

  144. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 144

    View Slide

  145. Recap
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 145

    View Slide

  146. SRE exper)se and circuit breaker
    • Protect microservices from unreliable microservice
    • Enforce contracts(alignment) among teams
    • Provide on-call free environment for new team
    • Enable developers to release experimental features
    • Reduce unproduc=ve communica=on among teams
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 146

    View Slide

  147. Bonus talk
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 147

    View Slide

  148. What is the best on-call rota0on?
    • It really depends on your team members
    • Someone loves weekly rota6on
    • Someone loves daily rota6on
    • Someone loves on-call on weekends
    • Don't create organiza6on-wide rota6on rule 13
    13 Well designed policy about on-call compensa6on is necessary to achieve this
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 148

    View Slide

  149. On-call rota+on strategy in Cookpad
    • Don't page with events which don't damage our SLO
    • Use advantages of ;me-zone differences and distributed team14
    • SREs and developers collaborate closely to fix problems
    14 Strategy for two-/er on-call rota/on, h5ps:/
    /blog.takanabe.tokyo
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 149

    View Slide

  150. On-call rota+on in Ruby backend team
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 150

    View Slide

  151. On-call rota+on in SRE team
    • Hybrid strategy to use advantages of 3me-zone differences
    • JP(UTC+9) & UK(UTC+0) business hour shiF
    • Daily off-hours rota3on
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 151

    View Slide

  152. + Incident evacua-on drill
    (≠Chaos engineering)
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 152

    View Slide

  153. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 153

    View Slide

  154. How can we introduce SRE in organiza3on?
    If you tackle to introduce the SRE methodology and culture with bo9om-up approaches,
    • Start from a small thing
    • Find your buddy from product develop teams who are happy to support your ideas
    • Provide incen;ve to your product developers
    • SREs are responsible for primary on-call if your services achieve your SLO standard (e.g: 99.99 % avaiability) for a
    month
    • Find win-win strategy for developers and SREs
    • Don't throw SRE sales pitch
    • Don't play "SRE is one of the Google best prac;ces" cards
    • We should seriously provide benefits to organiza;on with SRE methodologies (Why do we need SLO? What benefits
    do we have?)
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 154

    View Slide

  155. Achievements
    • Improvement of produc0on stability
    • Apply SRE technique to real service
    • Release of machine learning integrated search in produc0on 15
    • Release of machine learning oriented infrastruture
    15 Vector scoring for term embeddings in Elas5csearch - Speaker Deck
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 155

    View Slide

  156. What's next?
    • Promote SRE culture with ba4le-tested methodologies
    • Providing JWT auth endpoint for ML and other microservices
    • Machine learning researchers want to provide services that
    will be consumed by beta builds of mobile applicaCons
    • Monolith doesn't need frequent code changes for ML
    experiences
    • Monolith doesn't have to proxy anything (this sounds worry
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 156

    View Slide

  157. Thank you
    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 157

    View Slide