Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident Management in Microservices

Serhat Can
October 18, 2019

Incident Management in Microservices

Microservices require a shift in mindset and tooling. As monitoring is changing with more focus on distributed tracing because of the changing needs, the incident response must also adopt. In this presentation, I’ll answer “why” and “how” of on-call and incident response in microservices.

Serhat Can

October 18, 2019
Tweet

More Decks by Serhat Can

Other Decks in Technology

Transcript

  1. @srhtcn It is more profitable to focus on speeding recovery

    than preventing accidents. FROM THE SITE RELIABILITY WORKBOOK
  2. @srhtcn A good Service Level Object: 99.9% of HTTP calls

    will be successfully completed in < 100 ms
  3. @srhtcn A good Service Level Object: 99.9% of HTTP calls

    will be successfully completed in < 100 ms Focuses on user happiness
  4. @srhtcn A good Service Level Object: Focuses on user happiness

    Accepts 100% is the wrong target 99.9% of HTTP calls will be successfully completed in < 100 ms
  5. @srhtcn A good Service Level Object: Has an indicator (SLI),

    
 ideally ratio of two numbers 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target
  6. @srhtcn A good Service Level Object: Has an indicator (SLI),

    
 ideally ratio of two numbers Needs continuous improvement 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target
  7. @srhtcn Increasing demands Maintain high availability, performance and security within

    more complex systems Dev - Ops Better alignment of development and operations Put developers on-call
  8. @srhtcn Increasing demands Maintain high availability, performance and security within

    more complex systems Dev - Ops Better alignment of development and operations Management - Dev Better alignment of management and development Put developers on-call
  9. @srhtcn Google SRE model Is there an industry standard? OpsGenie

    Ownership model Amazon Total ownership model Source: increment.com/on-call/who-owns-on-call
  10. Incident Commander Responsible for managing the incident response process and

    providing direction to the responder teams. Important Roles Incident Commander Communications Officer Scribe Subject Matter Expert
  11. Communications Officer Responsible for handling communications with the stakeholders and

    responders. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  12. Scribe Responsible for documenting information related to incident and its

    response process.
 Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  13. Subject Matter Expert Technical domain experts who support the incident

    commander in incident resolution. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  14. @srhtcn Tracing This is a powerful way of observing what’s

    happening in a microservice or an entire system Take a look at: OpenTracing, Zipkin, Jaeger, OpenCensus, AWS X-Ray
  15. @srhtcn Alerting on SLOs Target Error Rate ≥ SLO Threshold

    The SLO has 99.9% target over 30 days. Create an alert if the error rate >= 0.1% percent over 10min The Site Reliability Workbook
  16. @srhtcn Group Incidents are often made of more than one

    alert. Scaling alerts in microservices Use similar alerting parameters Say no to toil and cognitive load that doesn’t scale Suppress noise Noisy alerts can be a big nightmare for on-call teams.
  17. "Production ready" means you can detect and fix problems 


    early enough to keep your customer happy.
  18. 1 2 3 Atlassian Incident Handbook https://www.atlassian.com/software/jira/ops/handbook Site Reliability Engineering

    / The Site Reliability Workbook https://landing.google.com/sre/book.html Increment issue 1 : On-call https://increment.com/on-call/ Useful Links