Incident Management in Microservices

SERHAT CAN | TECHNICAL EVANGELIST | @SRHTCN Incident Management  
in Microservices

@srhtcn

@srhtcn Continues Integration and Delivery Automation Observability We do our
best to be fast and robust Chaos Engineering

@srhtcn

@srhtcn Some Netflix creations: Simian Army, Chaos Monkey, Atlas, Vector…

@srhtcn It is more profitable to focus on speeding recovery
than preventing accidents. FROM THE SITE RELIABILITY WORKBOOK

Plan MAJOR INCIDENT MANAGEMENT Train Assemble TIME Declare Detect Resolve
Assess Execute Analyse

@srhtcn Agenda Service Level Objectives On-Call Ownership Incident Response Roles
Tracing Alerting

Service Level Objectives Set the right   reliability targets

@srhtcn

@srhtcn A good Service Level Object: 99.9% of HTTP calls
will be successfully completed in < 100 ms

@srhtcn A good Service Level Object: 99.9% of HTTP calls
will be successfully completed in < 100 ms Focuses on user happiness

@srhtcn A good Service Level Object: Focuses on user happiness
Accepts 100% is the wrong target 99.9% of HTTP calls will be successfully completed in < 100 ms

@srhtcn A good Service Level Object: Has an indicator (SLI),
  ideally ratio of two numbers 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target

@srhtcn A good Service Level Object: Has an indicator (SLI),
  ideally ratio of two numbers Needs continuous improvement 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target

On-Call Ownership Who owns on-call?

Two pizza teams should wash the dishes and throw the
trash by themselves.

@srhtcn Put developers on-call

@srhtcn Increasing demands Maintain high availability, performance and security within
more complex systems Put developers on-call

more complex systems Dev - Ops Better alignment of development and operations Put developers on-call

more complex systems Dev - Ops Better alignment of development and operations Management - Dev Better alignment of management and development Put developers on-call

@srhtcn Google SRE model Is there an industry standard? OpsGenie
Ownership model Amazon Total ownership model Source: increment.com/on-call/who-owns-on-call

@srhtcn

Incident Response Roles Commanding major incidents

Incident Commander Responsible for managing the incident response process and
providing direction to the responder teams. Important Roles Incident Commander Communications Officer Scribe Subject Matter Expert

Communications Officer Responsible for handling communications with the stakeholders and
responders. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander

Scribe Responsible for documenting information related to incident and its
response process.  Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander

Subject Matter Expert Technical domain experts who support the incident
commander in incident resolution. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander

Tracing Focus on simple and easy debugging

@srhtcn Tracing This is a powerful way of observing what’s
happening in a microservice or an entire system Take a look at: OpenTracing, Zipkin, Jaeger, OpenCensus, AWS X-Ray

@srhtcn

Tracing Manuel Automated Traces Service Map

Alerting Page or Ticket?

@srhtcn Alerting on SLOs Target Error Rate ≥ SLO Threshold
The SLO has 99.9% target over 30 days. Create an alert if the error rate >= 0.1% percent over 10min The Site Reliability Workbook

@srhtcn Severity levels to the rescue Source: https://www.atlassian.com/software/jira/ops/handbook/responding-to-an-incident

@srhtcn Group Incidents are often made of more than one
alert. Scaling alerts in microservices Use similar alerting parameters Say no to toil and cognitive load that doesn’t scale Suppress noise Noisy alerts can be a big nightmare for on-call teams.

"Production ready" means you can detect and fix problems  
early enough to keep your customer happy.

1 2 3 Atlassian Incident Handbook https://www.atlassian.com/software/jira/ops/handbook Site Reliability Engineering
/ The Site Reliability Workbook https://landing.google.com/sre/book.html Increment issue 1 : On-call https://increment.com/on-call/ Useful Links

@srhtcn Recommended Reads

SERHAT CAN | TECHNICAL EVANGELIST | @SRHTCN Thank you!

Incident Management in Microservices

Incident Management in Microservices

More Decks by Serhat Can

Other Decks in Technology

Featured

Transcript