Slide 1

Slide 1 text

SERHAT CAN | TECHNICAL EVANGELIST | @SRHTCN Incident Management 
 in Microservices

Slide 2

Slide 2 text

@srhtcn

Slide 3

Slide 3 text

@srhtcn Continues Integration and Delivery Automation Observability We do our best to be fast and robust Chaos Engineering

Slide 4

Slide 4 text

@srhtcn

Slide 5

Slide 5 text

@srhtcn

Slide 6

Slide 6 text

@srhtcn Some Netflix creations: Simian Army, Chaos Monkey, Atlas, Vector…

Slide 7

Slide 7 text

@srhtcn It is more profitable to focus on speeding recovery than preventing accidents. FROM THE SITE RELIABILITY WORKBOOK

Slide 8

Slide 8 text

Plan MAJOR INCIDENT MANAGEMENT Train Assemble TIME Declare Detect Resolve Assess Execute Analyse

Slide 9

Slide 9 text

@srhtcn Agenda Service Level Objectives On-Call Ownership Incident Response Roles Tracing Alerting

Slide 10

Slide 10 text

Service Level Objectives Set the right 
 reliability targets

Slide 11

Slide 11 text

@srhtcn

Slide 12

Slide 12 text

@srhtcn A good Service Level Object: 99.9% of HTTP calls will be successfully completed in < 100 ms

Slide 13

Slide 13 text

@srhtcn A good Service Level Object: 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness

Slide 14

Slide 14 text

@srhtcn A good Service Level Object: Focuses on user happiness Accepts 100% is the wrong target 99.9% of HTTP calls will be successfully completed in < 100 ms

Slide 15

Slide 15 text

@srhtcn A good Service Level Object: Has an indicator (SLI), 
 ideally ratio of two numbers 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target

Slide 16

Slide 16 text

@srhtcn A good Service Level Object: Has an indicator (SLI), 
 ideally ratio of two numbers Needs continuous improvement 99.9% of HTTP calls will be successfully completed in < 100 ms Focuses on user happiness Accepts 100% is the wrong target

Slide 17

Slide 17 text

On-Call Ownership Who owns on-call?

Slide 18

Slide 18 text

Two pizza teams should wash the dishes and throw the trash by themselves.

Slide 19

Slide 19 text

@srhtcn Put developers on-call

Slide 20

Slide 20 text

@srhtcn Increasing demands Maintain high availability, performance and security within more complex systems Put developers on-call

Slide 21

Slide 21 text

@srhtcn Increasing demands Maintain high availability, performance and security within more complex systems Dev - Ops Better alignment of development and operations Put developers on-call

Slide 22

Slide 22 text

@srhtcn Increasing demands Maintain high availability, performance and security within more complex systems Dev - Ops Better alignment of development and operations Management - Dev Better alignment of management and development Put developers on-call

Slide 23

Slide 23 text

@srhtcn Google SRE model Is there an industry standard? OpsGenie Ownership model Amazon Total ownership model Source: increment.com/on-call/who-owns-on-call

Slide 24

Slide 24 text

@srhtcn

Slide 25

Slide 25 text

Incident Response Roles Commanding major incidents

Slide 26

Slide 26 text

Incident Commander Responsible for managing the incident response process and providing direction to the responder teams. Important Roles Incident Commander Communications Officer Scribe Subject Matter Expert

Slide 27

Slide 27 text

Communications Officer Responsible for handling communications with the stakeholders and responders. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander

Slide 28

Slide 28 text

Scribe Responsible for documenting information related to incident and its response process.
 Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander

Slide 29

Slide 29 text

Subject Matter Expert Technical domain experts who support the incident commander in incident resolution. Important Roles Communications Officer Scribe Subject Matter Expert Incident Commander

Slide 30

Slide 30 text

Tracing Focus on simple and easy debugging

Slide 31

Slide 31 text

@srhtcn Tracing This is a powerful way of observing what’s happening in a microservice or an entire system Take a look at: OpenTracing, Zipkin, Jaeger, OpenCensus, AWS X-Ray

Slide 32

Slide 32 text

@srhtcn

Slide 33

Slide 33 text

Tracing Manuel Automated Traces Service Map

Slide 34

Slide 34 text

Tracing Manuel Automated Traces Service Map

Slide 35

Slide 35 text

Tracing Manuel Automated Traces Service Map

Slide 36

Slide 36 text

Tracing Manuel Automated Traces Service Map

Slide 37

Slide 37 text

Alerting Page or Ticket?

Slide 38

Slide 38 text

@srhtcn Alerting on SLOs Target Error Rate ≥ SLO Threshold The SLO has 99.9% target over 30 days. Create an alert if the error rate >= 0.1% percent over 10min The Site Reliability Workbook

Slide 39

Slide 39 text

@srhtcn Severity levels to the rescue Source: https://www.atlassian.com/software/jira/ops/handbook/responding-to-an-incident

Slide 40

Slide 40 text

@srhtcn Group Incidents are often made of more than one alert. Scaling alerts in microservices Use similar alerting parameters Say no to toil and cognitive load that doesn’t scale Suppress noise Noisy alerts can be a big nightmare for on-call teams.

Slide 41

Slide 41 text

"Production ready" means you can detect and fix problems 
 early enough to keep your customer happy.

Slide 42

Slide 42 text

1 2 3 Atlassian Incident Handbook https://www.atlassian.com/software/jira/ops/handbook Site Reliability Engineering / The Site Reliability Workbook https://landing.google.com/sre/book.html Increment issue 1 : On-call https://increment.com/on-call/ Useful Links

Slide 43

Slide 43 text

@srhtcn Recommended Reads

Slide 44

Slide 44 text

SERHAT CAN | TECHNICAL EVANGELIST | @SRHTCN Thank you!