Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Disaster Recovery and Reliability

Disaster Recovery and Reliability

As companies move towards Microservices and Cloud-based deployments, multiple points of failure emerge in the infrastructure. With critical uptime demands and tight SLAs, high availability and reliability become just as important and visible. In this talk I'll cover the what, why, and how of Disaster Recovery.
Disaster Recovery usually is an afterthought, but with good blueprints and practices, it should and can be a part of the SDLC, allowing us to build mission critical services that can survive disasters without degrading the quality of service. With the PasS offerings from public cloud platforms, DR has become a lot more cost effective and easy to implement. I will talk about key terms and metrics used for DR, and strategies which can be applicable to any type of infrastructure - on prem, on cloud, or hybrid.
After this talk, you will become very familiar with Disaster Recovery fundamentals, and start thinking about options and strategies available based on your SLA while you architect your service. Terms like hot-cold, active-active will become much more relatable as well!
I'll briefly refer to AWS and Data Center infrastructure during this talk - so basic familiarity with these will be helpful.

Manish Pandit

March 26, 2018

More Decks by Manish Pandit

Other Decks in Technology


  1. Why Define and contextualize Disaster Recovery in a business and

    technical context without boiling the ocean. In other words, this is a very, very high level overview of a topic where each slide can easily be a session on it’s own.
  2. Availability A measure of % of time a service is

    in a usable state. Also measured in 9s. Scheduled downtimes do not count towards availability, but may impact customer satisfaction metrics (more so in a B2C model).
  3. Reliability A measure of the probability of the service being

    in a usable state for a period of time. Measured as MTBF (Mean Time Between Failures), and the Failure Rate
  4. Connecting Reliability & Availability A database goes down for an

    unscheduled maintenance for an hour Availability = 98% (or 1 Nine) Reliability = 23 hours MTBF = 23 hours; as I can rely on that db for only 23 hours.
  5. BCP Business Continuity Plan “Business continuity planning (or business continuity

    and resiliency planning) is the process of creating systems of prevention and recovery to deal with potential threats to a company.” - Wikipedia Usually owned and managed by the COO
  6. Disaster Recovery Disaster Recovery is a component of BCP, covering

    the technical/infrastructure area. Usually owned and managed by the CTO/CIO.
  7. RPO Recovery Point Objective The maximum amount of data loss

    that is tolerable without significant impact to business continuity. Always defined backwards in time. Ideal value = 0
  8. RPO If the RPO is 4 hours, it’d mean you

    must have (good) backups of data no older than 4 hours. Think about your laptop. How much far back in time you can go where any data loss beyond that time is tolerable?
  9. RTO Recovery Time Objective Wider than RPO - Covers more

    than just data. The maximum amount of time the system can remain unavailable without significant impact to the business continuity. Ideal value = 0
  10. RTO and RPO If it takes 2 hours to restore

    the last backup that was done 4 hours ago, then RTO is >= 2 hours, and RPO is >= 4 hours. If a master fails, and the slave is 10 minutes behind, your RPO cannot be < 10 minutes. If the application needs to be bounced to update the db connections which takes 10 minutes, then the RTO cannot be < 10 minutes.
  11. PTO Paid Time Off following the the disaster recovery. *It

    is more or less a convention to throw PTO in there.
  12. That’s easy - get me zero RTO and RPO Zero

    RTO and/or RPO is realistically impossible. (why?) The business has to establish the tolerable RTO and RPO. This acts as a requirements-spec for the DR Plan and Implementation. These limits also help establish the SLA with customers.
  13. Tolerable? For a bank, an RPO greater than a few

    minutes = lost transactions. For an online broker, an RTO greater than a few minutes = lost trades. For a media company, RTO greater than a few minutes = angry tweets. For a static website, weekly backups are acceptable with a RPO of 1 week. For an HR system, RPO greater than a day may be acceptable, but RTO greater than a few hours may not.
  14. Hybrid Cloud Most companies run a hybrid cloud, which means

    the infrastructure is split (usually disproportionately) between on-prem and public cloud.
  15. Common Failures Network backbone/ISP Outage Software Bugs Storage Controller/NFS Crashes

    Disruptive changes to security settings/firewalls Corrupt DNS configuration being replicated AWS/Public Cloud Outage
  16. Backup & Restore Regular backups are copied to the recovery

    site. Infrastructure has to be spun up on the recovery site in the event of a disaster. RPO and RTO can be in hours, if not days. Inexpensive - Costs few hundred dollars a month for the storage.
  17. Pilot Light Infrastructure is provisioned, but needs to be started

    before taking any traffic (RTO!) Data replication may be a few seconds/minutes behind (RPO!) Lower RTO and RPO than Backup & Restore, a bit more $$ for replication.
  18. Warm Standby Infrastructure is provisioned, ready to take on traffic.

    It may need to be scaled up to handle full production load. Data replication may be a few seconds/minutes behind (RPO!) Lower RTO than Pilot Light, more $$ (why?)
  19. Multi-Site Multiple sites taking live production traffic Difficult to pull

    off due to database constraints (multi-master, anyone?) When done right, RPO and RTO of a few seconds to few minutes Costs an arm and a leg
  20. Multi Cloud Mother of them all. Automation to support multiple

    cloud providers, plus on- prem. RPO and RTO similar to multi-site, but provides isolation at a provider level. Costs an arm, a leg, and a kidney.
  21. Review your automation Follow the Pull-request model for infrastructure changes.

    Automating a destructive script (unintentionally) is the quickest way to a disaster. foreach ($env == ‘prod’); sudo chmod -R -rx
  22. Failure-as-a-service Inject failures in the infrastructure. Measure of readiness. Chaos

    Engineering. Netflix - Simian Army Amazon Aurora Failure Injections
  23. Blast Radius A DNS failure can take down an entire

    data center. A faulty switch can take down entire subnet. A service failure can take down all others dependent on it. A Region failure has larger blast radius than an Availability Zone failure A Provider failure has larger blast radius than a Region failure.
  24. Dashboards - Internal and External Service health monitoring is critical..

    ..so is ensuring that the monitors themselves can survive a disaster.
  25. Finally Make disaster recovery and high availability a topic of

    discussion during every stage of a project. Ask the hard questions. Embrace failure - learn from it.