2
CHALLENGE
WHAT IS THE
SLA FOR OUR
CLIENTS?
CAN YOU
IMPROVE IT? • Millions of clients
• No dedicated operations team
• 24/7 uptime and availability
• High performance / low latency
requirements
• 1 month to improve
Downtime : 8,5 min daily, > 1hr weekly
99.4%
99,95% 99,5% 99,95%
Zonal Resources
Slide 3
Slide 3 text
3
SLO
SLA
SLI
ERROR
BUDGETS
SLOs (Service Level Objectives)
• Availability is one KPI to define
success
• A numerical target that can be
measured
• E.g. ratio of successful to failed service
requests
Improving SLO comes at a cost
• Deployment frequency
• Cost of new features
• SLO should be lower than actual
Availability
https://sre.google/sre-book/embracing-risk/
Slide 4
Slide 4 text
4
SLO
SLA
SLI
ERROR
BUDGETS
SLA (Service Level Agreement)
Usually defined as a promise to
someone using your service to meet a
certain level of availability over a
certain time period (SLO).
Penalty if you do not meet your SLAs.
Consequences:
Must be monitored
Clients must not exceed service
usage (e.g. enforce quotas)
SLAs should be lower than SLOs
https://sre.google/sre-book/embracing-risk/
Slide 5
Slide 5 text
5
SLO
SLA
SLI
ERROR
BUDGETS
SLIs (Service Level Indicators)
Measurement of service behaviour
over time
Allows for corrective measures, if
SLOs are not met
https://sre.google/sre-book/embracing-risk/
Slide 6
Slide 6 text
6
SLO
SLA
SLI
ERROR
BUDGETS
ERROR BUDGET
• Pain tolerance of users
• Amount of errors that service can
accumulate over a period of time
• Often Business Decisions
• Can and should be planned
e.g. Maintenance Windows, block
new releases when error-budget is
spent
Slide 7
Slide 7 text
7
SLO
SLA
SLI
ERROR
BUDGETS
Slide 8
Slide 8 text
8
RELIABLE SOFTWARE
In the cloud age
Reliability Inherited From Base
Infrastructure &
Operations
Development
Software Improves Availability
Classic IT Cloud
APPS
Data center
Infrastructure
Network
Compute
Storage
APPS
availability availability
IaaS, IaaS, IaaS
Platform
Slide 9
Slide 9 text
9
THE MATH – DEPENDENT SERVICES
Slide 10
Slide 10 text
10
THE MATH - REDUNDANCY
Slide 11
Slide 11 text
11
TERMINOLOGY • Archetype: Abstract Model of your
Architecture, Lense of Availability
• Architecture – Product and Service
Design, e.g. K8s, Service Mesh, DBs
• Apps and Services – Your code,
constantly changing
16
• Deploy all services to two
zones in one region
(failover)
• Database with Read
Replica
• Can survive zone failure
• Cost: 2x
• Complexity: low
• Will not defend against
region failure
1) SINGLE ZONE OR FAILOVER ZONE
Slide 17
Slide 17 text
17
1) SINGLE ZONE OR FAILOVER ZONE
Slide 18
Slide 18 text
18
• Active/Active/Active
• Preconditions for services
(e.g. statelessness)
• Deploy to three zones
• HA DB (E.g. AlloyDB)
• Improves Failops (D-
Failover only)
• Still cannot handle region
failure
• Cost 1,5 Services, 2x Data
2) MULTI ZONAL / SINGLE REGION
Slide 19
Slide 19 text
19
2) SLA - MULTI ZONAL / SINGLE REGION
Slide 20
Slide 20 text
20
3) SLA - FAILOVER REGION
Slide 21
Slide 21 text
21
• Deploy all services to all zones
in both regions
• E.g. isolated user bases for
regulated industries
• DNS points at two regional LB
• Can survive zone and regional
failures for half of the user base
(failover possible)
• No failops (except for optional
regional failover)
• Costs 3 x Services, 4 x data
• Apps have to handle multi
region data
4) MULTI REGIONAL ISOLATED STACK
Slide 22
Slide 22 text
22
• Global Consumer Services
• Global Databases /
Services are more
expensive
• Survives zone and regional
failures
• No Failops
5) GLOBAL STACK
Slide 23
Slide 23 text
23
5) SLA - GLOBAL STACK
Slide 24
Slide 24 text
24
SO WHAT?
99,95% 99,5% 99,95%
Slide 25
Slide 25 text
25
RESULT
99,99%
99,5%
99,5% 99,99%
0,9999 * (1-0,005^2) *
0,9999 = 0,999775
99,98%
or
8min Monthly
1h 44min Yearly
Measured downtime
7min in 2 years!!
Slide 26
Slide 26 text
26
• Consider available
Archetypes from start as a
baseline. Switching can be
hard!
• Improving SLAs is
challenging.
• Independently of
Infrastructure SLA,s it is
still more likely you fuck
up than Google!
WHAT WE LEARNED