Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE Practices in Mercari Microservices

taichi nakashima
January 25, 2020
11k

SRE Practices in Mercari Microservices

This is a slide for SRE NEXT 2020 (https://sre-next.dev/).

Mercari Microservices Platform Team is building a platform for backend developers to build and run microservices. Currently, in this platform, around 100+ microservices are running and more than 200+ developers are working with. To run this scale of platform, the reliability is really critical. In this talk, I will share how we operate this platform by applying Google SRE practice: how we set and "update" SLI/SLO, how we ensure observability and prepare playbook and so on.

The references books, talks and links are in https://gist.github.com/tcnksm/cc7ce8d7edc5b31a4710633574664c61

taichi nakashima

January 25, 2020
Tweet

Transcript

  1. SRE Practices in Mercari Microservices
    SRE NEXT 2020

    View Slide

  2. Taichi Nakashima
    @deeeet / @tcnksm

    View Slide

  3. https://blog.eventuate.io/2017/01/04/the-microservice-architecture-is-a-means-to-an-end-enabling-continuous-deliverydeployment/

    View Slide

  4. Service A Team
    Platform Team and SRE Team in Microservices
    Mercari SRE Merpay SRE
    Service A
    Service B Team
    Service B
    Service C Team
    Service C
    Work closely or embedded
    Platform
    Platform Team

    View Slide

  5. 150+
    Microservices
    3000+
    Kubernetes Pods
    10
    Platformers

    View Slide

  6. Platform
    Platform Team
    Service A Team
    Reliabilities in Microservices
    Mercari SRE Merpay SRE
    Service A
    Service B Team
    Service B
    Service C Team
    Service C
    Work closely or embedded
    Reliabilities for microservices
    Reliabilities for platform

    View Slide

  7. SRE Practices for Platform

    View Slide

  8. SLI/SLO
    How we set SLI/SLO?
    How we update them
    and use for project
    decision?
    How we handle toils as a
    team? How we evolve
    on-supporting?
    How we design on-call
    for microservices? How
    we design alerting?
    On-call Toil

    View Slide

  9. https://cre.page.link/art-of-slos-slides
    Agility vs. Stability
    How do you incentivize reliability?

    View Slide

  10. Sprint backlog
    Prerequisite of SLI/SLO: Project Management
    Sprint team
    Sprint
    Review
    Sprint backlog
    Sprint
    Sprint backlog
    Sprint
    Priority discussion
    SLO
    Backlogs
    Sprint team
    Sprint team

    View Slide

  11. Principles of SLI/SLO

    ● Focus on customers (=internal developers)
    ● Don’t be perfect from beginning

    View Slide

  12. SLO Document (Template)

    ● Header
    ○ Author(s) & Approver(s)
    ○ Approval date & Revisit date
    ● Service Overview
    ● SLI and SLO
    ○ Category (availability, latency, quality, freshness, correctness, durability)
    ○ Time window (daily, weekly, monthly, quarter)
    ○ SLI specification (what) & implementation (how)
    ○ SLO
    ● Rationale
    https://landing.google.com/sre/workbook/chapters/slo-document/

    View Slide

  13. SLI/SLO for Spinnaker

    ● How we define SLI/SLO?
    ● How we update SLI/SLO?

    View Slide

  14. View Slide

  15. Spinnaker pipeline

    View Slide

  16. SLO for Spinnaker

    ● Pipeline execution success rate
    ● Pipeline execution duration

    View Slide

  17. SLO for Spinnaker

    Pipeline execution success rate
    ● Category: Availability
    ● Time window: Weekly
    ● SLI specification: The ratio of sample pipeline executions that finished successfully
    ● SLI implementation: The number of successful pipeline executions / total pipeline
    executions, as measured from Datadog metric from Datadog metric
    spinnaker.pipeline.duration which is provided by spinnaker-datadog-bridge
    ● SLO: 99.5% of the pipeline executions finish successfully.

    View Slide

  18. SLO for Spinnaker

    Pipeline execution duration
    ● Category: Latency
    ● Time window: Weekly
    ● SLI specification: The average duration of pipeline executions that finished
    successfully is less than 5 minutes.
    ● SLI implementation: The average duration of pipeline executions that finished
    successfully is less than 5 minutes., as measured from Datadog metric
    spinnaker.pipeline.duration which is provided by spinnaker-datadog-bridge
    ● SLO: 99.5% of the pipeline executions finish successfully within 5 minutes.

    View Slide

  19. Spinnaker SLO Monitoring by Datadog

    View Slide

  20. How We Update SLI/SLO?

    SLO revisit meeting agenda
    ● Check whether we meet SLO
    ● Review incident reports and postmortems
    ● Review support issues
    ● Check customer satisfaction
    ○ Ego-searching on Slack
    ○ Check developer survey results

    View Slide

  21. How We Update SLI/SLO: Developer Survey

    View Slide

  22. https://landing.google.com/sre/workbook/chapters/implementing-slos/
    How We Update SLI/SLO: SLO Decision Matrix

    View Slide

  23. Examples of SLI/SLO Update

    ● Case1
    ○ SLOs: Met, Toil: High, Customer satisfactions: Low
    ○ Tighten SLO
    ● Case2
    ○ SLOs: Met, Toil: Low, Customer satisfactions: Low
    ○ Change “SLI”, instead of tighten SLO (*)
    * From developer survey and ego-search, we noticed current SLI does not reflect customer usage

    View Slide

  24. After
    Pipeline start Pipeline finish
    Image push
    New SLI
    Before
    Pipeline start Pipeline finish
    Old SLI
    Example of SLI/SLO Update: Changing Spinnaker SLI

    View Slide

  25. Continue normal release
    On-support handles
    (Other member focus on its task)
    SLO
    On-support team
    OK
    Violate
    Capable
    Hard to handle
    Sprint team
    How We Use SLO?


    View Slide

  26. SLI/SLO
    How we set SLI/SLO?
    How we update them
    and use for project
    decision?
    How we handle toils as a
    team? How we evolve
    on-supporting?
    How we design on-call
    for microservices? How
    we design alerting?
    On-call Toil

    View Slide

  27. Service A Team
    Who is Being a On-call for Microservice?
    Service A
    Service B Team
    Service B
    Service C Team
    Service C
    Platform
    Platform Team
    On-call
    On-call On-call On-call
    Boundary

    View Slide

  28. Develop
    Fail
    Analysis
    Learn
    Why End-to-End Responsibility is Important?
    https://youtu.be/KLxwhsJuZ44

    View Slide

  29. Develop
    Fail
    Analysis
    Learn
    Why End-to-End Responsibility is Important?
    Dev team
    Ops team
    https://youtu.be/KLxwhsJuZ44
    Boundary

    View Slide

  30. Kubernetes Master
    Service namespace: A
    On-call Responsibility for Google Kubernetes Engine
    Service A Team
    Platform Team
    Google SRE team
    Pods
    System namespace: X
    Pods
    Kubernetes Nodes
    Service namespace: B
    Service B Team
    Pods
    Service namespace: C
    Service C Team
    Pods
    System namespace: Y
    Pods
    System namespace: Z
    Pods
    Boundary
    Boundary

    View Slide

  31. Principles of Alerting

    ● Alert by RED and investigate by USE
    ● Alert must be actionable
    Design alert so that you will be woken up at 3 a.m. in the middle of
    winter vacation

    View Slide

  32. RED
    Rate, Error, Duration
    USE
    Utilization, Saturation, Error
    Top-level health of system
    The system is working or not
    from customer point of view
    Low-level health of system
    System resource status

    View Slide

  33. Actionable Alert

    ● Prepare a playbook for one alert
    ● Review the alert and the playbook by members

    View Slide

  34. Playbook (Template)

    ● Links
    ○ SLO document
    ○ Datadog timeboard & screenboard
    ○ Github repository
    ● Alerts
    ○ Links to Datadog monitor
    ○ What (What does alert indicate?)
    ○ Why (Why this alert is required?)
    ○ Investigation
    ○ Mitigation

    View Slide

  35. Playbook (Vault)

    Vault pods are sealed!
    ● What: This alerts detects vault pods are sealed unexpectedly
    ● Why: Unexpected Vault seal means something wrong happens
    ● Investigation: (kubectl and vault commands for investigation)
    ● Mitigation: (kubectl and vault commands for mitigation)

    View Slide

  36. SLI/SLO
    How we set SLI/SLO?
    How we update them
    and use for project
    decision?
    How we handle toils as a
    team? How we evolve
    on-supporting?
    How we design on-call
    for microservices? How
    we design alerting?
    On-call Toil

    View Slide

  37. Reactive Tasks

    ● Toil (e.g., manual script for maintain platform components)
    ● Developer Support (e.g., answering questions)
    ● Bug fix (e.g., fixing CI automation scripts)
    ● Security fix (e.g., upgrading Kubernetes clusters)

    View Slide

  38. Principles of Reactive Tasks

    We can NOT completely remove reactive tasks. They are always there
    if you grow. Instead, think how to live together better

    View Slide

  39. "At least 50% of each SRE’s time should be spent on engineering project
    work that will either reduce future toil or add service features."
    https://landing.google.com/sre/sre-book/chapters/eliminating-toil/

    View Slide

  40. How We Evolve On-support ?
    Sprint team
    Sprint team
    Individual
    Sprint team
    Sprint team
    On-support
    On-support team
    Rotation Dedicated team
    rotation

    View Slide

  41. SLI/SLO
    How we set SLI/SLO?
    How we update them
    and use for project
    decision?
    How we handle toils as a
    team? How we evolve
    on-supporting?
    How we design on-call
    for microservices? How
    we design alerting?
    On-call Toil

    View Slide

  42. SRE Practices for Microservices

    View Slide

  43. Platform
    Platform Team
    Service A Team
    Mercari SRE Merpay SRE
    Service A
    Service B Team
    Service B
    Service C Team
    Service C
    Work closely or embedded
    150+ microservices
    How We Ensure Microservices Reliabilities?

    View Slide

  44. Principles of Microservices Reliability

    Bring platform team practices to development team and implement
    self-service capabilities

    View Slide

  45. Design Doc Production Readiness Check

    View Slide

  46. Microservices Design Doc (Template)

    ● Summary
    ● Background
    ○ Goals
    ○ Non-goals
    ● System Design
    ○ Interfaces
    ○ Traffic migration
    ○ Expected clients & dependencies
    ○ SLI/SLO
    ○ Databases
    ○ Security considerations

    View Slide

  47. Production Readiness Check

    ● Maintainability (e.g., Unit test, Test coverage, Automated build, ...)
    ● Observability (e.g., Datadog timeboard & screen board, Profiling, Logging)
    ● Reliability (e.g., Auto scale, Graceful shutdown, PDB, … )
    ● Security (e.g., Security team review, Non-sensitive logs, ...)
    ● Accessibility (e.g., Design doc, API doc, ...)
    ● Data Storage (e.g., Data replication, Backup and recovery,...)

    View Slide

  48. Production Readiness Check

    View Slide

  49. Design Doc Production Readiness Check

    View Slide

  50. Key Takeaways

    ● SRE practices for Platform
    ○ Don’t try to have perfect SLI/SLO but continuously improve
    ○ Design alert so that you can be woken up at 3 a.m.
    ○ Think about process to live together with toils
    ● SRE practices for Microservices
    ○ Design self-service way of improving reliability

    View Slide

  51. Conclusion

    The most important part of SRE practices is how you implement the
    workflows in your team or organization

    View Slide