Handling round-the-clock availability - Investigating Cloud Deployment Archetypes

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

2 CHALLENGE WHAT IS THE SLA FOR OUR CLIENTS? CAN YOU IMPROVE IT? • Millions of clients • No dedicated operations team • 24/7 uptime and availability • High performance / low latency requirements • 1 month to improve Downtime : 8,5 min daily, > 1hr weekly 99.4% 99,95% 99,5% 99,95% Zonal Resources

Slide 3

Slide 3 text

3 SLO SLA SLI ERROR BUDGETS SLOs (Service Level Objectives) • Availability is one KPI to define success • A numerical target that can be measured • E.g. ratio of successful to failed service requests Improving SLO comes at a cost • Deployment frequency • Cost of new features • SLO should be lower than actual Availability https://sre.google/sre-book/embracing-risk/

Slide 4

Slide 4 text

4 SLO SLA SLI ERROR BUDGETS SLA (Service Level Agreement) Usually defined as a promise to someone using your service to meet a certain level of availability over a certain time period (SLO). Penalty if you do not meet your SLAs. Consequences: Must be monitored Clients must not exceed service usage (e.g. enforce quotas) SLAs should be lower than SLOs https://sre.google/sre-book/embracing-risk/

Slide 5

Slide 5 text

5 SLO SLA SLI ERROR BUDGETS SLIs (Service Level Indicators) Measurement of service behaviour over time Allows for corrective measures, if SLOs are not met https://sre.google/sre-book/embracing-risk/

Slide 6

Slide 6 text

6 SLO SLA SLI ERROR BUDGETS ERROR BUDGET • Pain tolerance of users • Amount of errors that service can accumulate over a period of time • Often Business Decisions • Can and should be planned e.g. Maintenance Windows, block new releases when error-budget is spent

Slide 7

Slide 7 text

7 SLO SLA SLI ERROR BUDGETS

Slide 8

Slide 8 text

8 RELIABLE SOFTWARE In the cloud age Reliability Inherited From Base Infrastructure & Operations Development Software Improves Availability Classic IT Cloud APPS Data center Infrastructure Network Compute Storage APPS availability availability IaaS, IaaS, IaaS Platform

Slide 9

Slide 9 text

9 THE MATH – DEPENDENT SERVICES

Slide 10

Slide 10 text

10 THE MATH - REDUNDANCY

Slide 11

Slide 11 text

11 TERMINOLOGY • Archetype: Abstract Model of your Architecture, Lense of Availability • Architecture – Product and Service Design, e.g. K8s, Service Mesh, DBs • Apps and Services – Your code, constantly changing

Slide 12

Slide 12 text

12 APPLICATION – SERVICES - PRODUCTS

Slide 13

Slide 13 text

13 SLO 2 APPLICATION – SERVICES - PRODUCTS RISKS Efforts to improve Resilience SLO 3 SLO 1

Slide 14

Slide 14 text

DEPLOYMENT ARCHETYPES https://bit.ly/cloudarchetypes

Slide 15

Slide 15 text

15 ARCHETYPES CLASSIFICATION Multi-Cloud Hybrid (1) (2) (3) (5) Deployment Archetypes for Cloud Applications (4)

Slide 16

Slide 16 text

16 • Deploy all services to two zones in one region (failover) • Database with Read Replica • Can survive zone failure • Cost: 2x • Complexity: low • Will not defend against region failure 1) SINGLE ZONE OR FAILOVER ZONE

Slide 17

Slide 17 text

17 1) SINGLE ZONE OR FAILOVER ZONE

Slide 18

Slide 18 text

18 • Active/Active/Active • Preconditions for services (e.g. statelessness) • Deploy to three zones • HA DB (E.g. AlloyDB) • Improves Failops (D- Failover only) • Still cannot handle region failure • Cost 1,5 Services, 2x Data 2) MULTI ZONAL / SINGLE REGION

Slide 19

Slide 19 text

19 2) SLA - MULTI ZONAL / SINGLE REGION

Slide 20

Slide 20 text

20 3) SLA - FAILOVER REGION

Slide 21

Slide 21 text

21 • Deploy all services to all zones in both regions • E.g. isolated user bases for regulated industries • DNS points at two regional LB • Can survive zone and regional failures for half of the user base (failover possible) • No failops (except for optional regional failover) • Costs 3 x Services, 4 x data • Apps have to handle multi region data 4) MULTI REGIONAL ISOLATED STACK

Slide 22

Slide 22 text

22 • Global Consumer Services • Global Databases / Services are more expensive • Survives zone and regional failures • No Failops 5) GLOBAL STACK

Slide 23

Slide 23 text

23 5) SLA - GLOBAL STACK

Slide 24

Slide 24 text

24 SO WHAT? 99,95% 99,5% 99,95%

Slide 25

Slide 25 text

25 RESULT 99,99% 99,5% 99,5% 99,99% 0,9999 * (1-0,005^2) * 0,9999 = 0,999775 99,98% or 8min Monthly 1h 44min Yearly Measured downtime 7min in 2 years!!

Slide 26

Slide 26 text

26 • Consider available Archetypes from start as a baseline. Switching can be hard! • Improving SLAs is challenging. • Independently of Infrastructure SLA,s it is still more likely you fuck up than Google! WHAT WE LEARNED

Slide 27

Slide 27 text

27 REFERENCES • https://sre.google/sre-book/service-level-objectives/ • http://bit.ly/cloudarchetypes • https://cloud.google.com/blog/products/devops-sre/sre- fundamentals-slis-slas-and-slos