SRE Practices in Mercari Microservices

Ecb3acc2d246962361a4f8b3f7a6dd12?s=47 taichi nakashima
January 25, 2020
4.8k

SRE Practices in Mercari Microservices

This is a slide for SRE NEXT 2020 (https://sre-next.dev/).

Mercari Microservices Platform Team is building a platform for backend developers to build and run microservices. Currently, in this platform, around 100+ microservices are running and more than 200+ developers are working with. To run this scale of platform, the reliability is really critical. In this talk, I will share how we operate this platform by applying Google SRE practice: how we set and "update" SLI/SLO, how we ensure observability and prepare playbook and so on.

The references books, talks and links are in https://gist.github.com/tcnksm/cc7ce8d7edc5b31a4710633574664c61

Ecb3acc2d246962361a4f8b3f7a6dd12?s=128

taichi nakashima

January 25, 2020
Tweet

Transcript

  1. 4.

    Service A Team Platform Team and SRE Team in Microservices

    Mercari SRE Merpay SRE Service A Service B Team Service B Service C Team Service C Work closely or embedded Platform Platform Team
  2. 6.

    Platform Platform Team Service A Team Reliabilities in Microservices Mercari

    SRE Merpay SRE Service A Service B Team Service B Service C Team Service C Work closely or embedded Reliabilities for microservices Reliabilities for platform
  3. 8.

    SLI/SLO How we set SLI/SLO? How we update them and

    use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil
  4. 10.

    Sprint backlog Prerequisite of SLI/SLO: Project Management Sprint team Sprint

    Review Sprint backlog Sprint Sprint backlog Sprint Priority discussion SLO Backlogs Sprint team Sprint team
  5. 12.

    SLO Document (Template)
 • Header ◦ Author(s) & Approver(s) ◦

    Approval date & Revisit date • Service Overview • SLI and SLO ◦ Category (availability, latency, quality, freshness, correctness, durability) ◦ Time window (daily, weekly, monthly, quarter) ◦ SLI specification (what) & implementation (how) ◦ SLO • Rationale https://landing.google.com/sre/workbook/chapters/slo-document/
  6. 14.
  7. 17.

    SLO for Spinnaker
 Pipeline execution success rate • Category: Availability

    • Time window: Weekly • SLI specification: The ratio of sample pipeline executions that finished successfully • SLI implementation: The number of successful pipeline executions / total pipeline executions, as measured from Datadog metric from Datadog metric spinnaker.pipeline.duration which is provided by spinnaker-datadog-bridge • SLO: 99.5% of the pipeline executions finish successfully.
  8. 18.

    SLO for Spinnaker
 Pipeline execution duration • Category: Latency •

    Time window: Weekly • SLI specification: The average duration of pipeline executions that finished successfully is less than 5 minutes. • SLI implementation: The average duration of pipeline executions that finished successfully is less than 5 minutes., as measured from Datadog metric spinnaker.pipeline.duration which is provided by spinnaker-datadog-bridge • SLO: 99.5% of the pipeline executions finish successfully within 5 minutes.
  9. 20.

    How We Update SLI/SLO?
 SLO revisit meeting agenda • Check

    whether we meet SLO • Review incident reports and postmortems • Review support issues • Check customer satisfaction ◦ Ego-searching on Slack ◦ Check developer survey results
  10. 23.

    Examples of SLI/SLO Update
 • Case1 ◦ SLOs: Met, Toil:

    High, Customer satisfactions: Low ◦ Tighten SLO • Case2 ◦ SLOs: Met, Toil: Low, Customer satisfactions: Low ◦ Change “SLI”, instead of tighten SLO (*) * From developer survey and ego-search, we noticed current SLI does not reflect customer usage
  11. 24.

    After Pipeline start Pipeline finish Image push New SLI Before

    Pipeline start Pipeline finish Old SLI Example of SLI/SLO Update: Changing Spinnaker SLI
  12. 25.

    Continue normal release On-support handles (Other member focus on its

    task) SLO On-support team OK Violate Capable Hard to handle Sprint team How We Use SLO?

  13. 26.

    SLI/SLO How we set SLI/SLO? How we update them and

    use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil
  14. 27.

    Service A Team Who is Being a On-call for Microservice?

    Service A Service B Team Service B Service C Team Service C Platform Platform Team On-call On-call On-call On-call Boundary
  15. 29.

    Develop Fail Analysis Learn Why End-to-End Responsibility is Important? Dev

    team Ops team https://youtu.be/KLxwhsJuZ44 Boundary
  16. 30.

    Kubernetes Master Service namespace: A On-call Responsibility for Google Kubernetes

    Engine Service A Team Platform Team Google SRE team Pods System namespace: X Pods Kubernetes Nodes Service namespace: B Service B Team Pods Service namespace: C Service C Team Pods System namespace: Y Pods System namespace: Z Pods Boundary Boundary
  17. 31.

    Principles of Alerting
 • Alert by RED and investigate by

    USE • Alert must be actionable Design alert so that you will be woken up at 3 a.m. in the middle of winter vacation
  18. 32.

    RED Rate, Error, Duration USE Utilization, Saturation, Error Top-level health

    of system The system is working or not from customer point of view Low-level health of system System resource status
  19. 33.

    Actionable Alert
 • Prepare a playbook for one alert •

    Review the alert and the playbook by members
  20. 34.

    Playbook (Template)
 • Links ◦ SLO document ◦ Datadog timeboard

    & screenboard ◦ Github repository • Alerts ◦ Links to Datadog monitor ◦ What (What does alert indicate?) ◦ Why (Why this alert is required?) ◦ Investigation ◦ Mitigation
  21. 35.

    Playbook (Vault)
 Vault pods are sealed! • What: This alerts

    detects vault pods are sealed unexpectedly • Why: Unexpected Vault seal means something wrong happens • Investigation: (kubectl and vault commands for investigation) • Mitigation: (kubectl and vault commands for mitigation)
  22. 36.

    SLI/SLO How we set SLI/SLO? How we update them and

    use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil
  23. 37.

    Reactive Tasks
 • Toil (e.g., manual script for maintain platform

    components) • Developer Support (e.g., answering questions) • Bug fix (e.g., fixing CI automation scripts) • Security fix (e.g., upgrading Kubernetes clusters)
  24. 38.

    Principles of Reactive Tasks
 We can NOT completely remove reactive

    tasks. They are always there if you grow. Instead, think how to live together better
  25. 39.

    "At least 50% of each SRE’s time should be spent

    on engineering project work that will either reduce future toil or add service features." https://landing.google.com/sre/sre-book/chapters/eliminating-toil/
  26. 40.

    How We Evolve On-support ? Sprint team Sprint team Individual

    Sprint team Sprint team On-support On-support team Rotation Dedicated team rotation
  27. 41.

    SLI/SLO How we set SLI/SLO? How we update them and

    use for project decision? How we handle toils as a team? How we evolve on-supporting? How we design on-call for microservices? How we design alerting? On-call Toil
  28. 43.

    Platform Platform Team Service A Team Mercari SRE Merpay SRE

    Service A Service B Team Service B Service C Team Service C Work closely or embedded 150+ microservices How We Ensure Microservices Reliabilities?
  29. 46.

    Microservices Design Doc (Template)
 • Summary • Background ◦ Goals

    ◦ Non-goals • System Design ◦ Interfaces ◦ Traffic migration ◦ Expected clients & dependencies ◦ SLI/SLO ◦ Databases ◦ Security considerations
  30. 47.

    Production Readiness Check
 • Maintainability (e.g., Unit test, Test coverage,

    Automated build, ...) • Observability (e.g., Datadog timeboard & screen board, Profiling, Logging) • Reliability (e.g., Auto scale, Graceful shutdown, PDB, … ) • Security (e.g., Security team review, Non-sensitive logs, ...) • Accessibility (e.g., Design doc, API doc, ...) • Data Storage (e.g., Data replication, Backup and recovery,...)
  31. 50.

    Key Takeaways
 • SRE practices for Platform ◦ Don’t try

    to have perfect SLI/SLO but continuously improve ◦ Design alert so that you can be woken up at 3 a.m. ◦ Think about process to live together with toils • SRE practices for Microservices ◦ Design self-service way of improving reliability
  32. 51.

    Conclusion
 The most important part of SRE practices is how

    you implement the workflows in your team or organization