So2ware design and development in Cloud Na<ve Era • Container orchestrators: Why ECS? Why EKS(k8s)? • Explana<on of fundamental SRE words (e.g: SLO, SLI, Error budget) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 4
TechConf 2019, Challenges for Global Service from a Perspec>ve of SRE ~ 2nd season ~ 1 Cookpad TechConf 2018, Challenges for Global Service from a Perspec>ve of SRE SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 9
favorite recipes for cooking • (e.g) Personalized search, visual search, recommenda@ons • ML APIs: Other APIs can provide machine learning integrated features • (e.g) Image enhancement, image to recipe SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 16
are always two people on-call (primary and secondary, with different du:es), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shi?s, each engineer is on-call (primary or secondary) for one week every month." 4 "For produc7on on-call responsibili7es, I’ve found that two-7er 24/7 support requires eight engineers. As teams holding their own pagers have become increasingly mainstream, this has become an important sizing constraint, and I try to ensure that every engineering team’s steady state is eight people" 5 5 Larson, Will. An Elegant Puzzle: Systems of Engineering Management, 2.1 Sizing teams (p.33) 4 Google - Site Reliability Engineering Chapter 11 - Being On-Call SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 51
new mircorservices To-Be New search/ml team must be free from on-call pressures for their new microservices SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 52
restricted technology stack Search/ML team can use mainstream technology stack for their fields Produc6on outages due to new microservices No produc@on outages due to low service level microservices People have to be responsible for on- call rota6ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Many teams need tough nego6a6ons to release ML related features ML team can release experimental features with light process in produc@on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 56
• In Cookpad, only SRE team knows en2re system designs 7 7 Google, The Site Reliability Workbook, Chapter 7 - Simplicity 6 Google, Site Reliability Engineering, Chapter 31 - Communica<on and Collabora<on in SRE SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 59
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Produc=on outages due to new microservices No produc=on outages due to low service level microservices Design document People have to be responsible for on-call rota=ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego=a=ons to release ML related features ML team can release experimental features with light process in produc=on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 62
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document + ? Produc8on outages due to new microservices No produc8on outages due to low service level microservices Design document People have to be responsible for on- call rota8ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego8a8ons to release ML related features ML team can release experimental features with light process in produc8on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 63
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega1on and resource isola1on Produc=on outages due to new microservices No produc=on outages due to low service level microservices Design document People have to be responsible for on-call rota=ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego=a=ons to release ML related features ML team can release experimental features with light process in produc=on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 73
2019) • Follow the boundary we define in the design document • Don't share servers managed by SRE team • Use SaaS to accelerate minimum product development cycles 8 • e.g: CI • e.g: Observability 8 Prac'cal Monitoring: Effec've Strategies for the Real World, Chapter 2.3 PaAern #3: Buy, Not Build SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 74
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc2on outages due to low service level microservices Design document + ? People have to be responsible for on-call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 75
failures • Limits external service and network impacts • Don’t waste capacity calling a broken service • External service is slow • External service is down • Network is unstable SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 80
every 100ms based on a 10s rolling average • Open / Tripped • Fail fast - return 503 error • Stays in this state for 10s • Recovering / Half Open • Ramp up traffic over 10s • Check health every 100ms -> if fail go back to Open state • Return to Closed if health is OK aJer 10s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 84
new services with less risk to other parts of the applica8on • Slow responses ~= Outage! • Fallback strategies become more important • Add values to use SLOs for communica8on tools about service levels SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 85
proxy with circuit breaking middleware • Each microservice has it's own independently configured circuit breaker • Run as a sidecar container SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 87
errors connec0ng to the service • Shedding load can help some errors to recover! • ResponseCodeRa+o • Don’t bother calling broken serivice • LatencyAtQuan+leMS • Isolate slow services. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 88
class (a.k.a produc0on readiness review 9) 9 Google - Site Reliability Engineering, Chapter 32 - The Evolving SRE Engagement Model SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 96
rate Error rate for 99.9% SLO Time to exhaus8on 1 0.1% 30 days 2 0.2% 15 days 10 1% 3 days 1000 100% 43minutes 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 108
alerts and reduce the number of false posi7ves • Make the short window 1/12 the dura7on of the long window as the star7ng point Severity No*fica*on Long window Short window Burn rate Error budget consumed Cri$cal Pager 1 hour 5 minutes 14.4 2% Cri$cal Pager 6 hour 30 minutes 6 5% Warning Chat, $cket 3 days 6 hours 1 10% 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 110
templa0ng language • Simple extension of JSON • Eliminate duplica0on with object-orienta0on 11 google/jsonnet: Jsonnet - The data templa5ng language SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 116
• Too DRYed-configura<ons is difficult to maintain • We have to control the power and make configura<ons simple SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 120
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 123
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document + ? Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 124
par*al requests to search-v2 in feature toggle • Strict circuit breaking threshold (No SLO or extreamely low SLO) and fail fast when upstream is unstable • Rescue all errors in feature toggle • Fallback all requests to search-v1 when circuit breaker returns 503s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 127
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 133
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document + ? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 134
shared single ML API endpoint in feature toggle • Strict circuit breaking threshold (No SLO) and fail fast when upstream is unstable • Rescue all errors in feature toggle and dismiss • Change desDnaDon for each ML API based on request path SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 138
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SLO (No SLO) Circuit breaker Feature toggle + Path-based rouAng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 142
restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SLO (No SLO) Circuit breaker Feature toggle + Path-based rou<ng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 143
microservice • Enforce contracts(alignment) among teams • Provide on-call free environment for new team • Enable developers to release experimental features • Reduce unproduc=ve communica=on among teams SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 146
on your team members • Someone loves weekly rota6on • Someone loves daily rota6on • Someone loves on-call on weekends • Don't create organiza6on-wide rota6on rule 13 13 Well designed policy about on-call compensa6on is necessary to achieve this SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 148
to introduce the SRE methodology and culture with bo9om-up approaches, • Start from a small thing • Find your buddy from product develop teams who are happy to support your ideas • Provide incen;ve to your product developers • SREs are responsible for primary on-call if your services achieve your SLO standard (e.g: 99.99 % avaiability) for a month • Find win-win strategy for developers and SREs • Don't throw SRE sales pitch • Don't play "SRE is one of the Google best prac;ces" cards • We should seriously provide benefits to organiza;on with SRE methodologies (Why do we need SLO? What benefits do we have?) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 154
to real service • Release of machine learning integrated search in produc0on 15 • Release of machine learning oriented infrastruture 15 Vector scoring for term embeddings in Elas5csearch - Speaker Deck SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 155
Providing JWT auth endpoint for ML and other microservices • Machine learning researchers want to provide services that will be consumed by beta builds of mobile applicaCons • Monolith doesn't need frequent code changes for ML experiences • Monolith doesn't have to proxy anything (this sounds worry SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 156