$30 off During Our Annual Pro Sale. View Details »

Handling round-the-clock availability - Investi...

Handling round-the-clock availability - Investigating Cloud Deployment Archetypes

Achieving high availability while minimizing operational costs requires thorough exploration of the cloud architecture from the outset. The initial choice of a deployment archetype sets boundaries for the achievable service level agreements (SLAs). This talk delves into the various patterns available and guides on selecting the most appropriate strategy tailored to specific requirements.

Avatar for Posedio

Posedio PRO

March 14, 2024
Tweet

More Decks by Posedio

Other Decks in Programming

Transcript

  1. 2 CHALLENGE WHAT IS THE SLA FOR OUR CLIENTS? CAN

    YOU IMPROVE IT? • Millions of clients • No dedicated operations team • 24/7 uptime and availability • High performance / low latency requirements • 1 month to improve Downtime : 8,5 min daily, > 1hr weekly 99.4% 99,95% 99,5% 99,95% Zonal Resources
  2. 3 SLO SLA SLI ERROR BUDGETS SLOs (Service Level Objectives)

    • Availability is one KPI to define success • A numerical target that can be measured • E.g. ratio of successful to failed service requests Improving SLO comes at a cost • Deployment frequency • Cost of new features • SLO should be lower than actual Availability https://sre.google/sre-book/embracing-risk/
  3. 4 SLO SLA SLI ERROR BUDGETS SLA (Service Level Agreement)

    Usually defined as a promise to someone using your service to meet a certain level of availability over a certain time period (SLO). Penalty if you do not meet your SLAs. Consequences: Must be monitored Clients must not exceed service usage (e.g. enforce quotas) SLAs should be lower than SLOs https://sre.google/sre-book/embracing-risk/
  4. 5 SLO SLA SLI ERROR BUDGETS SLIs (Service Level Indicators)

    Measurement of service behaviour over time Allows for corrective measures, if SLOs are not met https://sre.google/sre-book/embracing-risk/
  5. 6 SLO SLA SLI ERROR BUDGETS ERROR BUDGET • Pain

    tolerance of users • Amount of errors that service can accumulate over a period of time • Often Business Decisions • Can and should be planned e.g. Maintenance Windows, block new releases when error-budget is spent
  6. 8 RELIABLE SOFTWARE In the cloud age Reliability Inherited From

    Base Infrastructure & Operations Development Software Improves Availability Classic IT Cloud APPS Data center Infrastructure Network Compute Storage APPS availability availability IaaS, IaaS, IaaS Platform
  7. 11 TERMINOLOGY • Archetype: Abstract Model of your Architecture, Lense

    of Availability • Architecture – Product and Service Design, e.g. K8s, Service Mesh, DBs • Apps and Services – Your code, constantly changing
  8. 16 • Deploy all services to two zones in one

    region (failover) • Database with Read Replica • Can survive zone failure • Cost: 2x • Complexity: low • Will not defend against region failure 1) SINGLE ZONE OR FAILOVER ZONE
  9. 18 • Active/Active/Active • Preconditions for services (e.g. statelessness) •

    Deploy to three zones • HA DB (E.g. AlloyDB) • Improves Failops (D- Failover only) • Still cannot handle region failure • Cost 1,5 Services, 2x Data 2) MULTI ZONAL / SINGLE REGION
  10. 21 • Deploy all services to all zones in both

    regions • E.g. isolated user bases for regulated industries • DNS points at two regional LB • Can survive zone and regional failures for half of the user base (failover possible) • No failops (except for optional regional failover) • Costs 3 x Services, 4 x data • Apps have to handle multi region data 4) MULTI REGIONAL ISOLATED STACK
  11. 22 • Global Consumer Services • Global Databases / Services

    are more expensive • Survives zone and regional failures • No Failops 5) GLOBAL STACK
  12. 25 RESULT 99,99% 99,5% 99,5% 99,99% 0,9999 * (1-0,005^2) *

    0,9999 = 0,999775 99,98% or 8min Monthly 1h 44min Yearly Measured downtime 7min in 2 years!!
  13. 26 • Consider available Archetypes from start as a baseline.

    Switching can be hard! • Improving SLAs is challenging. • Independently of Infrastructure SLA,s it is still more likely you fuck up than Google! WHAT WE LEARNED