Incident Management in Microservices

8f43892395260c6ad14618987099ddcc?s=47 Serhat Can
October 18, 2019

Incident Management in Microservices

Microservices require a shift in mindset and tooling. As monitoring is changing with more focus on distributed tracing because of the changing needs, the incident response must also adopt. In this presentation, I’ll answer “why” and “how” of on-call and incident response in microservices.

8f43892395260c6ad14618987099ddcc?s=128

Serhat Can

October 18, 2019
Tweet

Transcript

  1. SERHAT CAN | TECHNICAL EVANGELIST | @SRHTCN Incident Management 


    in Microservices
  2. @srhtcn

  3. @srhtcn Continues Integration and Delivery Automation Observability We do our

    best to be fast and robust Chaos Engineering
  4. @srhtcn

  5. @srhtcn

  6. @srhtcn Some Netflix creations: Simian Army, Chaos Monkey, Atlas, Vector…

  7. @srhtcn It is more profitable to focus on speeding recovery

    than preventing accidents. FROM THE SITE RELIABILITY WORKBOOK
  8. Plan MAJOR INCIDENT MANAGEMENT Train Assemble TIME Declare Detect Resolve

    Assess Execute Analyse
  9. @srhtcn Agenda Service Level Objectives On-Call Ownership Incident Response Roles

    Tracing Alerting
  10. Service Level Objectives Set the right 
 reliability targets

  11. @srhtcn

  12. @srhtcn A good Service Level Object: 99.9% of HTTP calls

    will be successfully completed in < 100 ms
  13. @srhtcn A good Service Level Object: 99.9% of HTTP calls

    will be successfully completed in < 100 ms Focuses on user happiness
  14. @srhtcn A good Service Level Object: Focuses on user happiness

    Accepts 100% is the wrong target 99.9% of HTTP calls will be successfully completed in < 100 ms
  15. @srhtcn A good Service Level Object: Has an indicator (SLI),

    
 ideally ratio of two numbers 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target
  16. @srhtcn A good Service Level Object: Has an indicator (SLI),

    
 ideally ratio of two numbers Needs continuous improvement 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target
  17. On-Call Ownership Who owns on-call?

  18. Two pizza teams should wash the dishes and throw the

    trash by themselves.
  19. @srhtcn Put developers on-call

  20. @srhtcn Increasing demands Maintain high availability, performance and security within

    more complex systems Put developers on-call
  21. @srhtcn Increasing demands Maintain high availability, performance and security within

    more complex systems Dev - Ops Better alignment of development and operations Put developers on-call
  22. @srhtcn Increasing demands Maintain high availability, performance and security within

    more complex systems Dev - Ops Better alignment of development and operations Management - Dev Better alignment of management and development Put developers on-call
  23. @srhtcn Google SRE model Is there an industry standard? OpsGenie

    Ownership model Amazon Total ownership model Source: increment.com/on-call/who-owns-on-call
  24. @srhtcn

  25. Incident Response Roles Commanding major incidents

  26. Incident Commander Responsible for managing the incident response process and

    providing direction to the responder teams. Important Roles Incident Commander Communications Officer Scribe Subject Matter Expert
  27. Communications Officer Responsible for handling communications with the stakeholders and

    responders. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  28. Scribe Responsible for documenting information related to incident and its

    response process.
 Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  29. Subject Matter Expert Technical domain experts who support the incident

    commander in incident resolution. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander
  30. Tracing Focus on simple and easy debugging

  31. @srhtcn Tracing This is a powerful way of observing what’s

    happening in a microservice or an entire system Take a look at: OpenTracing, Zipkin, Jaeger, OpenCensus, AWS X-Ray
  32. @srhtcn

  33. Tracing Manuel Automated Traces Service Map

  34. Tracing Manuel Automated Traces Service Map

  35. Tracing Manuel Automated Traces Service Map

  36. Tracing Manuel Automated Traces Service Map

  37. Alerting Page or Ticket?

  38. @srhtcn Alerting on SLOs Target Error Rate ≥ SLO Threshold

    The SLO has 99.9% target over 30 days. Create an alert if the error rate >= 0.1% percent over 10min The Site Reliability Workbook
  39. @srhtcn Severity levels to the rescue Source: https://www.atlassian.com/software/jira/ops/handbook/responding-to-an-incident

  40. @srhtcn Group Incidents are often made of more than one

    alert. Scaling alerts in microservices Use similar alerting parameters Say no to toil and cognitive load that doesn’t scale Suppress noise Noisy alerts can be a big nightmare for on-call teams.
  41. "Production ready" means you can detect and fix problems 


    early enough to keep your customer happy.
  42. 1 2 3 Atlassian Incident Handbook https://www.atlassian.com/software/jira/ops/handbook Site Reliability Engineering

    / The Site Reliability Workbook https://landing.google.com/sre/book.html Increment issue 1 : On-call https://increment.com/on-call/ Useful Links
  43. @srhtcn Recommended Reads

  44. SERHAT CAN | TECHNICAL EVANGELIST | @SRHTCN Thank you!