Slide 1

Slide 1 text

SRE Practices in Mercari Microservices SRE NEXT 2020

Slide 2

Slide 2 text

Taichi Nakashima @deeeet / @tcnksm

Slide 3

Slide 3 text

https://blog.eventuate.io/2017/01/04/the-microservice-architecture-is-a-means-to-an-end-enabling-continuous-deliverydeployment/

Slide 4

Slide 4 text

Service A Team Platform Team and SRE Team in Microservices Mercari SRE Merpay SRE Service A Service B Team Service B Service C Team Service C Work closely or embedded Platform Platform Team

Slide 5

Slide 5 text

150+ Microservices 3000+ Kubernetes Pods 10 Platformers

Slide 6

Slide 6 text

Platform Platform Team Service A Team Reliabilities in Microservices Mercari SRE Merpay SRE Service A Service B Team Service B Service C Team Service C Work closely or embedded Reliabilities for microservices Reliabilities for platform

Slide 7

Slide 7 text

SRE Practices for Platform

Slide 8

Slide 8 text

SLI/SLO How we set SLI/SLO? How we update them and use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil

Slide 9

Slide 9 text

https://cre.page.link/art-of-slos-slides Agility vs. Stability How do you incentivize reliability?

Slide 10

Slide 10 text

Sprint backlog Prerequisite of SLI/SLO: Project Management Sprint team Sprint Review Sprint backlog Sprint Sprint backlog Sprint Priority discussion SLO Backlogs Sprint team Sprint team

Slide 11

Slide 11 text

Principles of SLI/SLO
 ● Focus on customers (=internal developers) ● Don’t be perfect from beginning

Slide 12

Slide 12 text

SLO Document (Template)
 ● Header ○ Author(s) & Approver(s) ○ Approval date & Revisit date ● Service Overview ● SLI and SLO ○ Category (availability, latency, quality, freshness, correctness, durability) ○ Time window (daily, weekly, monthly, quarter) ○ SLI specification (what) & implementation (how) ○ SLO ● Rationale https://landing.google.com/sre/workbook/chapters/slo-document/

Slide 13

Slide 13 text

SLI/SLO for Spinnaker
 ● How we define SLI/SLO? ● How we update SLI/SLO?

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Spinnaker pipeline

Slide 16

Slide 16 text

SLO for Spinnaker
 ● Pipeline execution success rate ● Pipeline execution duration

Slide 17

Slide 17 text

SLO for Spinnaker
 Pipeline execution success rate ● Category: Availability ● Time window: Weekly ● SLI specification: The ratio of sample pipeline executions that finished successfully ● SLI implementation: The number of successful pipeline executions / total pipeline executions, as measured from Datadog metric from Datadog metric spinnaker.pipeline.duration which is provided by spinnaker-datadog-bridge ● SLO: 99.5% of the pipeline executions finish successfully.

Slide 18

Slide 18 text

SLO for Spinnaker
 Pipeline execution duration ● Category: Latency ● Time window: Weekly ● SLI specification: The average duration of pipeline executions that finished successfully is less than 5 minutes. ● SLI implementation: The average duration of pipeline executions that finished successfully is less than 5 minutes., as measured from Datadog metric spinnaker.pipeline.duration which is provided by spinnaker-datadog-bridge ● SLO: 99.5% of the pipeline executions finish successfully within 5 minutes.

Slide 19

Slide 19 text

Spinnaker SLO Monitoring by Datadog

Slide 20

Slide 20 text

How We Update SLI/SLO?
 SLO revisit meeting agenda ● Check whether we meet SLO ● Review incident reports and postmortems ● Review support issues ● Check customer satisfaction ○ Ego-searching on Slack ○ Check developer survey results

Slide 21

Slide 21 text

How We Update SLI/SLO: Developer Survey

Slide 22

Slide 22 text

https://landing.google.com/sre/workbook/chapters/implementing-slos/ How We Update SLI/SLO: SLO Decision Matrix

Slide 23

Slide 23 text

Examples of SLI/SLO Update
 ● Case1 ○ SLOs: Met, Toil: High, Customer satisfactions: Low ○ Tighten SLO ● Case2 ○ SLOs: Met, Toil: Low, Customer satisfactions: Low ○ Change “SLI”, instead of tighten SLO (*) * From developer survey and ego-search, we noticed current SLI does not reflect customer usage

Slide 24

Slide 24 text

After Pipeline start Pipeline finish Image push New SLI Before Pipeline start Pipeline finish Old SLI Example of SLI/SLO Update: Changing Spinnaker SLI

Slide 25

Slide 25 text

Continue normal release On-support handles (Other member focus on its task) SLO On-support team OK Violate Capable Hard to handle Sprint team How We Use SLO?


Slide 26

Slide 26 text

SLI/SLO How we set SLI/SLO? How we update them and use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil

Slide 27

Slide 27 text

Service A Team Who is Being a On-call for Microservice? Service A Service B Team Service B Service C Team Service C Platform Platform Team On-call On-call On-call On-call Boundary

Slide 28

Slide 28 text

Develop Fail Analysis Learn Why End-to-End Responsibility is Important? https://youtu.be/KLxwhsJuZ44

Slide 29

Slide 29 text

Develop Fail Analysis Learn Why End-to-End Responsibility is Important? Dev team Ops team https://youtu.be/KLxwhsJuZ44 Boundary

Slide 30

Slide 30 text

Kubernetes Master Service namespace: A On-call Responsibility for Google Kubernetes Engine Service A Team Platform Team Google SRE team Pods System namespace: X Pods Kubernetes Nodes Service namespace: B Service B Team Pods Service namespace: C Service C Team Pods System namespace: Y Pods System namespace: Z Pods Boundary Boundary

Slide 31

Slide 31 text

Principles of Alerting
 ● Alert by RED and investigate by USE ● Alert must be actionable Design alert so that you will be woken up at 3 a.m. in the middle of winter vacation

Slide 32

Slide 32 text

RED Rate, Error, Duration USE Utilization, Saturation, Error Top-level health of system The system is working or not from customer point of view Low-level health of system System resource status

Slide 33

Slide 33 text

Actionable Alert
 ● Prepare a playbook for one alert ● Review the alert and the playbook by members

Slide 34

Slide 34 text

Playbook (Template)
 ● Links ○ SLO document ○ Datadog timeboard & screenboard ○ Github repository ● Alerts ○ Links to Datadog monitor ○ What (What does alert indicate?) ○ Why (Why this alert is required?) ○ Investigation ○ Mitigation

Slide 35

Slide 35 text

Playbook (Vault)
 Vault pods are sealed! ● What: This alerts detects vault pods are sealed unexpectedly ● Why: Unexpected Vault seal means something wrong happens ● Investigation: (kubectl and vault commands for investigation) ● Mitigation: (kubectl and vault commands for mitigation)

Slide 36

Slide 36 text

SLI/SLO How we set SLI/SLO? How we update them and use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil

Slide 37

Slide 37 text

Reactive Tasks
 ● Toil (e.g., manual script for maintain platform components) ● Developer Support (e.g., answering questions) ● Bug fix (e.g., fixing CI automation scripts) ● Security fix (e.g., upgrading Kubernetes clusters)

Slide 38

Slide 38 text

Principles of Reactive Tasks
 We can NOT completely remove reactive tasks. They are always there if you grow. Instead, think how to live together better

Slide 39

Slide 39 text

"At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features." https://landing.google.com/sre/sre-book/chapters/eliminating-toil/

Slide 40

Slide 40 text

How We Evolve On-support ? Sprint team Sprint team Individual Sprint team Sprint team On-support On-support team Rotation Dedicated team rotation

Slide 41

Slide 41 text

SLI/SLO How we set SLI/SLO? How we update them and use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil

Slide 42

Slide 42 text

SRE Practices for Microservices

Slide 43

Slide 43 text

Platform Platform Team Service A Team Mercari SRE Merpay SRE Service A Service B Team Service B Service C Team Service C Work closely or embedded 150+ microservices How We Ensure Microservices Reliabilities?

Slide 44

Slide 44 text

Principles of Microservices Reliability
 Bring platform team practices to development team and implement self-service capabilities

Slide 45

Slide 45 text

Design Doc Production Readiness Check

Slide 46

Slide 46 text

Microservices Design Doc (Template)
 ● Summary ● Background ○ Goals ○ Non-goals ● System Design ○ Interfaces ○ Traffic migration ○ Expected clients & dependencies ○ SLI/SLO ○ Databases ○ Security considerations

Slide 47

Slide 47 text

Production Readiness Check
 ● Maintainability (e.g., Unit test, Test coverage, Automated build, ...) ● Observability (e.g., Datadog timeboard & screen board, Profiling, Logging) ● Reliability (e.g., Auto scale, Graceful shutdown, PDB, … ) ● Security (e.g., Security team review, Non-sensitive logs, ...) ● Accessibility (e.g., Design doc, API doc, ...) ● Data Storage (e.g., Data replication, Backup and recovery,...)

Slide 48

Slide 48 text

Production Readiness Check

Slide 49

Slide 49 text

Design Doc Production Readiness Check

Slide 50

Slide 50 text

Key Takeaways
 ● SRE practices for Platform ○ Don’t try to have perfect SLI/SLO but continuously improve ○ Design alert so that you can be woken up at 3 a.m. ○ Think about process to live together with toils ● SRE practices for Microservices ○ Design self-service way of improving reliability

Slide 51

Slide 51 text

Conclusion
 The most important part of SRE practices is how you implement the workflows in your team or organization