Handling round-the-clock availability

2 CHALLENGE WHAT IS THE SLA FOR OUR CLIENTS? CAN
YOU IMPROVE IT? • Millions of clients • No dedicated operations team • 24/7 uptime and availability • High performance / low latency requirements • 1 month to improve Downtime : 8,5 min daily, > 1hr weekly 99.4% 99,95% 99,5% 99,95% Zonal Resources

3 SLO SLA SLI ERROR BUDGETS SLOs (Service Level Objectives)
• Availability is one KPI to define success • A numerical target that can be measured • E.g. ratio of successful to failed service requests Improving SLO comes at a cost • Deployment frequency • Cost of new features • SLO should be lower than actual Availability https://sre.google/sre-book/embracing-risk/

4 SLO SLA SLI ERROR BUDGETS SLA (Service Level Agreement)
Usually defined as a promise to someone using your service to meet a certain level of availability over a certain time period (SLO). Penalty if you do not meet your SLAs. Consequences: Must be monitored Clients must not exceed service usage (e.g. enforce quotas) SLAs should be lower than SLOs https://sre.google/sre-book/embracing-risk/

5 SLO SLA SLI ERROR BUDGETS SLIs (Service Level Indicators)
Measurement of service behaviour over time Allows for corrective measures, if SLOs are not met https://sre.google/sre-book/embracing-risk/

6 SLO SLA SLI ERROR BUDGETS ERROR BUDGET • Pain
tolerance of users • Amount of errors that service can accumulate over a period of time • Often Business Decisions • Can and should be planned e.g. Maintenance Windows, block new releases when error-budget is spent

7 SLO SLA SLI ERROR BUDGETS

8 RELIABLE SOFTWARE In the cloud age Reliability Inherited From
Base Infrastructure & Operations Development Software Improves Availability Classic IT Cloud APPS Data center Infrastructure Network Compute Storage APPS availability availability IaaS, IaaS, IaaS Platform

9 THE MATH – DEPENDENT SERVICES

10 THE MATH - REDUNDANCY

11 TERMINOLOGY • Archetype: Abstract Model of your Architecture, Lense
of Availability • Architecture – Product and Service Design, e.g. K8s, Service Mesh, DBs • Apps and Services – Your code, constantly changing

12 APPLICATION – SERVICES - PRODUCTS

13 SLO 2 APPLICATION – SERVICES - PRODUCTS RISKS Efforts
to improve Resilience SLO 3 SLO 1

DEPLOYMENT ARCHETYPES https://bit.ly/cloudarchetypes

15 ARCHETYPES CLASSIFICATION Multi-Cloud Hybrid (1) (2) (3) (5) Deployment
Archetypes for Cloud Applications (4)

16 • Deploy all services to two zones in one
region (failover) • Database with Read Replica • Can survive zone failure • Cost: 2x • Complexity: low • Will not defend against region failure 1) SINGLE ZONE OR FAILOVER ZONE

17 1) SINGLE ZONE OR FAILOVER ZONE

18 • Active/Active/Active • Preconditions for services (e.g. statelessness) •
Deploy to three zones • HA DB (E.g. AlloyDB) • Improves Failops (D- Failover only) • Still cannot handle region failure • Cost 1,5 Services, 2x Data 2) MULTI ZONAL / SINGLE REGION

19 2) SLA - MULTI ZONAL / SINGLE REGION

20 3) SLA - FAILOVER REGION

21 • Deploy all services to all zones in both
regions • E.g. isolated user bases for regulated industries • DNS points at two regional LB • Can survive zone and regional failures for half of the user base (failover possible) • No failops (except for optional regional failover) • Costs 3 x Services, 4 x data • Apps have to handle multi region data 4) MULTI REGIONAL ISOLATED STACK

22 • Global Consumer Services • Global Databases / Services
are more expensive • Survives zone and regional failures • No Failops 5) GLOBAL STACK

23 5) SLA - GLOBAL STACK

24 SO WHAT? 99,95% 99,5% 99,95%

25 RESULT 99,99% 99,5% 99,5% 99,99% 0,9999 * (1-0,005^2) *
0,9999 = 0,999775 99,98% or 8min Monthly 1h 44min Yearly Measured downtime 7min in 2 years!!

26 • Consider available Archetypes from start as a baseline.
Switching can be hard! • Improving SLAs is challenging. • Independently of Infrastructure SLA,s it is still more likely you fuck up than Google! WHAT WE LEARNED

27 REFERENCES • https://sre.google/sre-book/service-level-objectives/ • http://bit.ly/cloudarchetypes • https://cloud.google.com/blog/products/devops-sre/sre- fundamentals-slis-slas-and-slos

Handling round-the-clock availability

Handling round-the-clock availability

Posedio
PRO

More Decks by Posedio

Other Decks in Programming

Featured

Transcript

2 CHALLENGE WHAT IS THE SLA FOR OUR CLIENTS? CAN

3 SLO SLA SLI ERROR BUDGETS SLOs (Service Level Objectives)

4 SLO SLA SLI ERROR BUDGETS SLA (Service Level Agreement)

5 SLO SLA SLI ERROR BUDGETS SLIs (Service Level Indicators)

6 SLO SLA SLI ERROR BUDGETS ERROR BUDGET • Pain

7 SLO SLA SLI ERROR BUDGETS

8 RELIABLE SOFTWARE In the cloud age Reliability Inherited From

9 THE MATH – DEPENDENT SERVICES

10 THE MATH - REDUNDANCY

11 TERMINOLOGY • Archetype: Abstract Model of your Architecture, Lense

12 APPLICATION – SERVICES - PRODUCTS

13 SLO 2 APPLICATION – SERVICES - PRODUCTS RISKS Efforts

DEPLOYMENT ARCHETYPES https://bit.ly/cloudarchetypes

15 ARCHETYPES CLASSIFICATION Multi-Cloud Hybrid (1) (2) (3) (5) Deployment

16 • Deploy all services to two zones in one

17 1) SINGLE ZONE OR FAILOVER ZONE

18 • Active/Active/Active • Preconditions for services (e.g. statelessness) •

19 2) SLA - MULTI ZONAL / SINGLE REGION

20 3) SLA - FAILOVER REGION

21 • Deploy all services to all zones in both

22 • Global Consumer Services • Global Databases / Services

23 5) SLA - GLOBAL STACK

24 SO WHAT? 99,95% 99,5% 99,95%

25 RESULT 99,99% 99,5% 99,5% 99,99% 0,9999 * (1-0,005^2) *

26 • Consider available Archetypes from start as a baseline.

27 REFERENCES • https://sre.google/sre-book/service-level-objectives/ • http://bit.ly/cloudarchetypes • https://cloud.google.com/blog/products/devops-sre/sre- fundamentals-slis-slas-and-slos