SRE Practices in Mercari Microservices

SRE Practices in Mercari Microservices SRE NEXT 2020

Taichi Nakashima @deeeet / @tcnksm

https://blog.eventuate.io/2017/01/04/the-microservice-architecture-is-a-means-to-an-end-enabling-continuous-deliverydeployment/

Service A Team Platform Team and SRE Team in Microservices
Mercari SRE Merpay SRE Service A Service B Team Service B Service C Team Service C Work closely or embedded Platform Platform Team

150+ Microservices 3000+ Kubernetes Pods 10 Platformers

Platform Platform Team Service A Team Reliabilities in Microservices Mercari
SRE Merpay SRE Service A Service B Team Service B Service C Team Service C Work closely or embedded Reliabilities for microservices Reliabilities for platform

SRE Practices for Platform

SLI/SLO How we set SLI/SLO? How we update them and
use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil

https://cre.page.link/art-of-slos-slides Agility vs. Stability How do you incentivize reliability?

Sprint backlog Prerequisite of SLI/SLO: Project Management Sprint team Sprint
Review Sprint backlog Sprint Sprint backlog Sprint Priority discussion SLO Backlogs Sprint team Sprint team

Principles of SLI/SLO  • Focus on customers (=internal developers) •
Don’t be perfect from beginning

SLO Document (Template)  • Header ◦ Author(s) & Approver(s) ◦
Approval date & Revisit date • Service Overview • SLI and SLO ◦ Category (availability, latency, quality, freshness, correctness, durability) ◦ Time window (daily, weekly, monthly, quarter) ◦ SLI speciﬁcation (what) & implementation (how) ◦ SLO • Rationale https://landing.google.com/sre/workbook/chapters/slo-document/

SLI/SLO for Spinnaker  • How we deﬁne SLI/SLO? • How
we update SLI/SLO?

Spinnaker pipeline

SLO for Spinnaker  • Pipeline execution success rate • Pipeline
execution duration

SLO for Spinnaker  Pipeline execution success rate • Category: Availability
• Time window: Weekly • SLI specification: The ratio of sample pipeline executions that finished successfully • SLI implementation: The number of successful pipeline executions / total pipeline executions, as measured from Datadog metric from Datadog metric spinnaker.pipeline.duration which is provided by spinnaker-datadog-bridge • SLO: 99.5% of the pipeline executions finish successfully.

SLO for Spinnaker  Pipeline execution duration • Category: Latency •
Time window: Weekly • SLI specification: The average duration of pipeline executions that finished successfully is less than 5 minutes. • SLI implementation: The average duration of pipeline executions that finished successfully is less than 5 minutes., as measured from Datadog metric spinnaker.pipeline.duration which is provided by spinnaker-datadog-bridge • SLO: 99.5% of the pipeline executions finish successfully within 5 minutes.

Spinnaker SLO Monitoring by Datadog

How We Update SLI/SLO?  SLO revisit meeting agenda • Check
whether we meet SLO • Review incident reports and postmortems • Review support issues • Check customer satisfaction ◦ Ego-searching on Slack ◦ Check developer survey results

How We Update SLI/SLO: Developer Survey

https://landing.google.com/sre/workbook/chapters/implementing-slos/ How We Update SLI/SLO: SLO Decision Matrix

Examples of SLI/SLO Update  • Case1 ◦ SLOs: Met, Toil:
High, Customer satisfactions: Low ◦ Tighten SLO • Case2 ◦ SLOs: Met, Toil: Low, Customer satisfactions: Low ◦ Change “SLI”, instead of tighten SLO (*) * From developer survey and ego-search, we noticed current SLI does not reﬂect customer usage

After Pipeline start Pipeline ﬁnish Image push New SLI Before
Pipeline start Pipeline ﬁnish Old SLI Example of SLI/SLO Update: Changing Spinnaker SLI

Continue normal release On-support handles (Other member focus on its
task) SLO On-support team OK Violate Capable Hard to handle Sprint team How We Use SLO? 

Service A Team Who is Being a On-call for Microservice?
Service A Service B Team Service B Service C Team Service C Platform Platform Team On-call On-call On-call On-call Boundary

Develop Fail Analysis Learn Why End-to-End Responsibility is Important? https://youtu.be/KLxwhsJuZ44

Develop Fail Analysis Learn Why End-to-End Responsibility is Important? Dev
team Ops team https://youtu.be/KLxwhsJuZ44 Boundary

Kubernetes Master Service namespace: A On-call Responsibility for Google Kubernetes
Engine Service A Team Platform Team Google SRE team Pods System namespace: X Pods Kubernetes Nodes Service namespace: B Service B Team Pods Service namespace: C Service C Team Pods System namespace: Y Pods System namespace: Z Pods Boundary Boundary

Principles of Alerting  • Alert by RED and investigate by
USE • Alert must be actionable Design alert so that you will be woken up at 3 a.m. in the middle of winter vacation

RED Rate, Error, Duration USE Utilization, Saturation, Error Top-level health
of system The system is working or not from customer point of view Low-level health of system System resource status

Actionable Alert  • Prepare a playbook for one alert •
Review the alert and the playbook by members

Playbook (Template)  • Links ◦ SLO document ◦ Datadog timeboard
& screenboard ◦ Github repository • Alerts ◦ Links to Datadog monitor ◦ What (What does alert indicate?) ◦ Why (Why this alert is required?) ◦ Investigation ◦ Mitigation

Playbook (Vault)  Vault pods are sealed! • What: This alerts
detects vault pods are sealed unexpectedly • Why: Unexpected Vault seal means something wrong happens • Investigation: (kubectl and vault commands for investigation) • Mitigation: (kubectl and vault commands for mitigation)

Reactive Tasks  • Toil (e.g., manual script for maintain platform
components) • Developer Support (e.g., answering questions) • Bug fix (e.g., fixing CI automation scripts) • Security fix (e.g., upgrading Kubernetes clusters)

Principles of Reactive Tasks  We can NOT completely remove reactive
tasks. They are always there if you grow. Instead, think how to live together better

"At least 50% of each SRE’s time should be spent
on engineering project work that will either reduce future toil or add service features." https://landing.google.com/sre/sre-book/chapters/eliminating-toil/

How We Evolve On-support ? Sprint team Sprint team Individual
Sprint team Sprint team On-support On-support team Rotation Dedicated team rotation

SRE Practices for Microservices

Platform Platform Team Service A Team Mercari SRE Merpay SRE
Service A Service B Team Service B Service C Team Service C Work closely or embedded 150+ microservices How We Ensure Microservices Reliabilities?

Principles of Microservices Reliability  Bring platform team practices to development
team and implement self-service capabilities

Design Doc Production Readiness Check

Microservices Design Doc (Template)  • Summary • Background ◦ Goals
◦ Non-goals • System Design ◦ Interfaces ◦ Trafﬁc migration ◦ Expected clients & dependencies ◦ SLI/SLO ◦ Databases ◦ Security considerations

Production Readiness Check  • Maintainability (e.g., Unit test, Test coverage,
Automated build, ...) • Observability (e.g., Datadog timeboard & screen board, Proﬁling, Logging) • Reliability (e.g., Auto scale, Graceful shutdown, PDB, … ) • Security (e.g., Security team review, Non-sensitive logs, ...) • Accessibility (e.g., Design doc, API doc, ...) • Data Storage (e.g., Data replication, Backup and recovery,...)

Production Readiness Check

Design Doc Production Readiness Check

Key Takeaways  • SRE practices for Platform ◦ Don’t try
to have perfect SLI/SLO but continuously improve ◦ Design alert so that you can be woken up at 3 a.m. ◦ Think about process to live together with toils • SRE practices for Microservices ◦ Design self-service way of improving reliability

Conclusion  The most important part of SRE practices is how
you implement the workﬂows in your team or organization

SRE Practices in Mercari Microservices

SRE Practices in Mercari Microservices

More Decks by taichi nakashima

Featured

Transcript