Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling round-the-clock availability

Handling round-the-clock availability

Investigating Cloud Deployment Archetypes

Posedio

March 14, 2024
Tweet

More Decks by Posedio

Other Decks in Programming

Transcript

  1. 2 CHALLENGE WHAT IS THE SLA FOR OUR CLIENTS? CAN

    YOU IMPROVE IT? • Millions of clients • No dedicated operations team • 24/7 uptime and availability • High performance / low latency requirements • 1 month to improve Downtime : 8,5 min daily, > 1hr weekly 99.4% 99,95% 99,5% 99,95% Zonal Resources
  2. 3 SLO SLA SLI ERROR BUDGETS SLOs (Service Level Objectives)

    • Availability is one KPI to define success • A numerical target that can be measured • E.g. ratio of successful to failed service requests Improving SLO comes at a cost • Deployment frequency • Cost of new features • SLO should be lower than actual Availability https://sre.google/sre-book/embracing-risk/
  3. 4 SLO SLA SLI ERROR BUDGETS SLA (Service Level Agreement)

    Usually defined as a promise to someone using your service to meet a certain level of availability over a certain time period (SLO). Penalty if you do not meet your SLAs. Consequences: Must be monitored Clients must not exceed service usage (e.g. enforce quotas) SLAs should be lower than SLOs https://sre.google/sre-book/embracing-risk/
  4. 5 SLO SLA SLI ERROR BUDGETS SLIs (Service Level Indicators)

    Measurement of service behaviour over time Allows for corrective measures, if SLOs are not met https://sre.google/sre-book/embracing-risk/
  5. 6 SLO SLA SLI ERROR BUDGETS ERROR BUDGET • Pain

    tolerance of users • Amount of errors that service can accumulate over a period of time • Often Business Decisions • Can and should be planned e.g. Maintenance Windows, block new releases when error-budget is spent
  6. 8 RELIABLE SOFTWARE In the cloud age Reliability Inherited From

    Base Infrastructure & Operations Development Software Improves Availability Classic IT Cloud APPS Data center Infrastructure Network Compute Storage APPS availability availability IaaS, IaaS, IaaS Platform
  7. 11 TERMINOLOGY • Archetype: Abstract Model of your Architecture, Lense

    of Availability • Architecture – Product and Service Design, e.g. K8s, Service Mesh, DBs • Apps and Services – Your code, constantly changing
  8. 16 • Deploy all services to two zones in one

    region (failover) • Database with Read Replica • Can survive zone failure • Cost: 2x • Complexity: low • Will not defend against region failure 1) SINGLE ZONE OR FAILOVER ZONE
  9. 18 • Active/Active/Active • Preconditions for services (e.g. statelessness) •

    Deploy to three zones • HA DB (E.g. AlloyDB) • Improves Failops (D- Failover only) • Still cannot handle region failure • Cost 1,5 Services, 2x Data 2) MULTI ZONAL / SINGLE REGION
  10. 21 • Deploy all services to all zones in both

    regions • E.g. isolated user bases for regulated industries • DNS points at two regional LB • Can survive zone and regional failures for half of the user base (failover possible) • No failops (except for optional regional failover) • Costs 3 x Services, 4 x data • Apps have to handle multi region data 4) MULTI REGIONAL ISOLATED STACK
  11. 22 • Global Consumer Services • Global Databases / Services

    are more expensive • Survives zone and regional failures • No Failops 5) GLOBAL STACK
  12. 25 RESULT 99,99% 99,5% 99,5% 99,99% 0,9999 * (1-0,005^2) *

    0,9999 = 0,999775 99,98% or 8min Monthly 1h 44min Yearly Measured downtime 7min in 2 years!!
  13. 26 • Consider available Archetypes from start as a baseline.

    Switching can be hard! • Improving SLAs is challenging. • Independently of Infrastructure SLA,s it is still more likely you fuck up than Google! WHAT WE LEARNED