Service Level Objectives
Set the right
reliability targets
Slide 11
Slide 11 text
@srhtcn
Slide 12
Slide 12 text
@srhtcn
A good Service Level Object:
99.9% of HTTP calls will be successfully completed in < 100 ms
Slide 13
Slide 13 text
@srhtcn
A good Service Level Object:
99.9% of HTTP calls will be successfully completed in < 100 ms
Focuses on user happiness
Slide 14
Slide 14 text
@srhtcn
A good Service Level Object:
Focuses on user happiness Accepts 100% is the wrong target
99.9% of HTTP calls will be successfully completed in < 100 ms
Slide 15
Slide 15 text
@srhtcn
A good Service Level Object:
Has an indicator (SLI),
ideally ratio of two numbers
99.9% of HTTP calls will be successfully completed in < 100 ms
Focuses on user happiness Accepts 100% is the wrong target
Slide 16
Slide 16 text
@srhtcn
A good Service Level Object:
Has an indicator (SLI),
ideally ratio of two numbers
Needs continuous improvement
99.9% of HTTP calls will be successfully completed in < 100 ms
Focuses on user happiness Accepts 100% is the wrong target
Slide 17
Slide 17 text
On-Call Ownership
Who owns on-call?
Slide 18
Slide 18 text
Two pizza teams should wash
the dishes and throw the trash
by themselves.
Slide 19
Slide 19 text
@srhtcn
Put developers on-call
Slide 20
Slide 20 text
@srhtcn
Increasing demands
Maintain high availability,
performance and security
within more complex
systems
Put developers on-call
Slide 21
Slide 21 text
@srhtcn
Increasing demands
Maintain high availability,
performance and security
within more complex
systems
Dev - Ops
Better alignment of
development and
operations
Put developers on-call
Slide 22
Slide 22 text
@srhtcn
Increasing demands
Maintain high availability,
performance and security
within more complex
systems
Dev - Ops
Better alignment of
development and
operations
Management - Dev
Better alignment of
management and
development
Put developers on-call
Slide 23
Slide 23 text
@srhtcn
Google
SRE model
Is there an industry standard?
OpsGenie
Ownership model
Amazon
Total ownership model
Source: increment.com/on-call/who-owns-on-call
Slide 24
Slide 24 text
@srhtcn
Slide 25
Slide 25 text
Incident Response Roles
Commanding major
incidents
Slide 26
Slide 26 text
Incident Commander
Responsible for managing the incident
response process and providing
direction to the responder teams.
Important
Roles
Incident
Commander
Communications
Officer
Scribe
Subject Matter
Expert
Slide 27
Slide 27 text
Communications Officer
Responsible for handling communications
with the stakeholders and responders.
Important
Roles
Communications
Officer
Scribe
Subject Matter
Expert
Incident
Commander
Slide 28
Slide 28 text
Scribe
Responsible for documenting information
related to incident and its response process.
Important
Roles
Communications
Officer
Scribe
Subject Matter
Expert
Incident
Commander
Slide 29
Slide 29 text
Subject Matter Expert
Technical domain experts who support the
incident commander in incident resolution.
Important
Roles
Communications
Officer
Scribe
Subject Matter
Expert
Incident
Commander
Slide 30
Slide 30 text
Tracing
Focus on simple and
easy debugging
Slide 31
Slide 31 text
@srhtcn
Tracing
This is a powerful way of observing
what’s happening in a microservice or
an entire system
Take a look at: OpenTracing, Zipkin, Jaeger,
OpenCensus, AWS X-Ray
Slide 32
Slide 32 text
@srhtcn
Slide 33
Slide 33 text
Tracing
Manuel
Automated
Traces
Service Map
Slide 34
Slide 34 text
Tracing
Manuel
Automated
Traces
Service Map
Slide 35
Slide 35 text
Tracing
Manuel
Automated
Traces
Service Map
Slide 36
Slide 36 text
Tracing
Manuel
Automated
Traces
Service Map
Slide 37
Slide 37 text
Alerting
Page or Ticket?
Slide 38
Slide 38 text
@srhtcn
Alerting on SLOs
Target Error Rate ≥ SLO Threshold
The SLO has 99.9% target over 30 days. Create an alert if the error rate >= 0.1% percent over 10min
The Site Reliability Workbook
Slide 39
Slide 39 text
@srhtcn
Severity levels to the rescue
Source: https://www.atlassian.com/software/jira/ops/handbook/responding-to-an-incident
Slide 40
Slide 40 text
@srhtcn
Group
Incidents are often made of
more than one alert.
Scaling alerts in microservices
Use similar alerting
parameters
Say no to toil and cognitive
load that doesn’t scale
Suppress noise
Noisy alerts can be a big
nightmare for on-call teams.
Slide 41
Slide 41 text
"Production ready" means
you can detect and fix problems
early enough
to keep your customer happy.
Slide 42
Slide 42 text
1
2
3
Atlassian Incident Handbook
https://www.atlassian.com/software/jira/ops/handbook
Site Reliability Engineering / The Site Reliability Workbook
https://landing.google.com/sre/book.html
Increment issue 1 : On-call
https://increment.com/on-call/
Useful Links
Slide 43
Slide 43 text
@srhtcn
Recommended Reads
Slide 44
Slide 44 text
SERHAT CAN | TECHNICAL EVANGELIST | @SRHTCN
Thank you!