YOU IMPROVE IT? • Millions of clients • No dedicated operations team • 24/7 uptime and availability • High performance / low latency requirements • 1 month to improve Downtime : 8,5 min daily, > 1hr weekly 99.4% 99,95% 99,5% 99,95% Zonal Resources
• Availability is one KPI to define success • A numerical target that can be measured • E.g. ratio of successful to failed service requests Improving SLO comes at a cost • Deployment frequency • Cost of new features • SLO should be lower than actual Availability https://sre.google/sre-book/embracing-risk/
Usually defined as a promise to someone using your service to meet a certain level of availability over a certain time period (SLO). Penalty if you do not meet your SLAs. Consequences: Must be monitored Clients must not exceed service usage (e.g. enforce quotas) SLAs should be lower than SLOs https://sre.google/sre-book/embracing-risk/
tolerance of users • Amount of errors that service can accumulate over a period of time • Often Business Decisions • Can and should be planned e.g. Maintenance Windows, block new releases when error-budget is spent
Base Infrastructure & Operations Development Software Improves Availability Classic IT Cloud APPS Data center Infrastructure Network Compute Storage APPS availability availability IaaS, IaaS, IaaS Platform
region (failover) • Database with Read Replica • Can survive zone failure • Cost: 2x • Complexity: low • Will not defend against region failure 1) SINGLE ZONE OR FAILOVER ZONE
Deploy to three zones • HA DB (E.g. AlloyDB) • Improves Failops (D- Failover only) • Still cannot handle region failure • Cost 1,5 Services, 2x Data 2) MULTI ZONAL / SINGLE REGION
regions • E.g. isolated user bases for regulated industries • DNS points at two regional LB • Can survive zone and regional failures for half of the user base (failover possible) • No failops (except for optional regional failover) • Costs 3 x Services, 4 x data • Apps have to handle multi region data 4) MULTI REGIONAL ISOLATED STACK
Switching can be hard! • Improving SLAs is challenging. • Independently of Infrastructure SLA,s it is still more likely you fuck up than Google! WHAT WE LEARNED