Incident Management in Microservices

8f43892395260c6ad14618987099ddcc?s=47 Serhat Can
October 18, 2019

Incident Management in Microservices

Microservices require a shift in mindset and tooling. As monitoring is changing with more focus on distributed tracing because of the changing needs, the incident response must also adopt. In this presentation, I’ll answer “why” and “how” of on-call and incident response in microservices.

8f43892395260c6ad14618987099ddcc?s=128

Serhat Can

October 18, 2019
Tweet

Transcript

  1. 2.
  2. 4.
  3. 5.
  4. 7.

    @srhtcn It is more profitable to focus on speeding recovery

    than preventing accidents. FROM THE SITE RELIABILITY WORKBOOK
  5. 11.
  6. 12.

    @srhtcn A good Service Level Object: 99.9% of HTTP calls

    will be successfully completed in < 100 ms
  7. 13.

    @srhtcn A good Service Level Object: 99.9% of HTTP calls

    will be successfully completed in < 100 ms Focuses on user happiness
  8. 14.

    @srhtcn A good Service Level Object: Focuses on user happiness

    Accepts 100% is the wrong target 99.9% of HTTP calls will be successfully completed in < 100 ms
  9. 15.

    @srhtcn A good Service Level Object: Has an indicator (SLI),

    
 ideally ratio of two numbers 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target
  10. 16.

    @srhtcn A good Service Level Object: Has an indicator (SLI),

    
 ideally ratio of two numbers Needs continuous improvement 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target
  11. 21.

    @srhtcn Increasing demands Maintain high availability, performance and security within

    more complex systems Dev - Ops Better alignment of development and operations Put developers on-call
  12. 22.

    @srhtcn Increasing demands Maintain high availability, performance and security within

    more complex systems Dev - Ops Better alignment of development and operations Management - Dev Better alignment of management and development Put developers on-call
  13. 23.

    @srhtcn Google SRE model Is there an industry standard? OpsGenie

    Ownership model Amazon Total ownership model Source: increment.com/on-call/who-owns-on-call
  14. 24.
  15. 26.

    Incident Commander Responsible for managing the incident response process and

    providing direction to the responder teams. Important Roles Incident Commander Communications Officer Scribe Subject Matter Expert
  16. 27.

    Communications Officer Responsible for handling communications with the stakeholders and

    responders. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  17. 28.

    Scribe Responsible for documenting information related to incident and its

    response process.
 Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  18. 29.

    Subject Matter Expert Technical domain experts who support the incident

    commander in incident resolution. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  19. 31.

    @srhtcn Tracing This is a powerful way of observing what’s

    happening in a microservice or an entire system Take a look at: OpenTracing, Zipkin, Jaeger, OpenCensus, AWS X-Ray
  20. 32.
  21. 38.

    @srhtcn Alerting on SLOs Target Error Rate ≥ SLO Threshold

    The SLO has 99.9% target over 30 days. Create an alert if the error rate >= 0.1% percent over 10min The Site Reliability Workbook
  22. 40.

    @srhtcn Group Incidents are often made of more than one

    alert. Scaling alerts in microservices Use similar alerting parameters Say no to toil and cognitive load that doesn’t scale Suppress noise Noisy alerts can be a big nightmare for on-call teams.
  23. 41.

    "Production ready" means you can detect and fix problems 


    early enough to keep your customer happy.
  24. 42.

    1 2 3 Atlassian Incident Handbook https://www.atlassian.com/software/jira/ops/handbook Site Reliability Engineering

    / The Site Reliability Workbook https://landing.google.com/sre/book.html Increment issue 1 : On-call https://increment.com/on-call/ Useful Links