Slide 1

Slide 1 text

Disaster Recovery What, Why and How Manish Pandit Silicon Valley Code Camp, 2018

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Why Define and contextualize Disaster Recovery in a business and technical context without boiling the ocean. In other words, this is a very, very high level overview of a topic where each slide can easily be a session on it’s own.

Slide 7

Slide 7 text

About Me Manish Pandit Sr. Director of Engineering at Marqeta @lobster1234 lobster1234.github.io

Slide 8

Slide 8 text

Sorry for the....math :(

Slide 9

Slide 9 text

Availability A measure of % of time a service is in a usable state. Also measured in 9s. Scheduled downtimes do not count towards availability, but may impact customer satisfaction metrics (more so in a B2C model).

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Uptime Often interchangeable with Availability Gotcha: Uptime does not mean much if the server cannot serve requests

Slide 12

Slide 12 text

Reliability A measure of the probability of the service being in a usable state for a period of time. Mean Time to Failure (MTTF) Mean Time to Repair (MTTR) Mean Time between Failures (MTBF) Mostly used for hardware such as Network/IO controllers, power supplies, etc.

Slide 13

Slide 13 text

Reliability “A rack switch goes unresponsive for 28 mins every day” MTTF = 23 hours 32 minutes MTTR = 28 minutes MTBF = 24 hours (MTTF + MTTR)

Slide 14

Slide 14 text

Disasters

Slide 15

Slide 15 text

BCP Business Continuity Plan “Business continuity planning (or business continuity and resiliency planning) is the process of creating systems of prevention and recovery to deal with potential threats to a company.” - Wikipedia Usually owned and managed by the COO

Slide 16

Slide 16 text

Disaster Recovery Disaster Recovery starts where High Availability stops.

Slide 17

Slide 17 text

Disaster Recovery Disaster Recovery is a component of BCP, covering the technical/infrastructure aspects. Usually owned and managed by the CTO/CIO.

Slide 18

Slide 18 text

But...how do we put metrics around Disaster Recovery Plan?

Slide 19

Slide 19 text

RPO Recovery Point Objective The maximum amount of data loss that is tolerable without significant impact to business continuity. Always defined backwards in time. Ideal value = 0

Slide 20

Slide 20 text

RPO If the RPO is 4 hours, it’d mean you must have (good) backups of data no older than 4 hours. Think about your laptop. How much far back in time you can go where any data loss beyond that time is tolerable?

Slide 21

Slide 21 text

RTO Recovery Time Objective Wider than RPO - Covers more than just data. The maximum amount of time the system can remain unavailable without significant impact to the business continuity. Ideal value = 0

Slide 22

Slide 22 text

Source: CloudAcademy

Slide 23

Slide 23 text

RTO and RPO If it takes 2 hours to restore the last backup that was done 4 hours ago, then RTO is >= 2 hours, and RPO is >= 4 hours. If a master fails, and the slave is 10 minutes behind, your RPO cannot be < 10 minutes. If the application needs to be bounced to update the db connections which takes 10 minutes, then the RTO cannot be < 10 minutes.

Slide 24

Slide 24 text

PTO* Paid Time Off following the the disaster recovery. *It is more or less a convention to throw PTO in there.

Slide 25

Slide 25 text

Who decides RTO and RPO? The business does.

Slide 26

Slide 26 text

That’s easy - get me zero RTO and RPO Zero RTO and/or RPO is realistically impossible. (why?) The business has to establish the tolerable RTO and RPO. This acts as a requirements-spec for the DR Plan and Implementation. These limits also help establish the SLA with customers.

Slide 27

Slide 27 text

Tolerable? For a bank, an RPO greater than a few minutes = lost transactions. For an online broker, an RTO greater than a few minutes = lost trades. For a media company, RTO greater than a few minutes = angry tweets. For a static website, weekly backups are acceptable with a RPO of 1 week. For an HR system, RPO greater than a day may be acceptable, but RTO greater than a few hours may not.

Slide 28

Slide 28 text

Common Failures Network backbone/ISP Outage Software Bugs Storage Controller/NFS Crashes Disruptive changes to security settings/firewalls Corrupt DNS configuration being replicated AWS/Public Cloud Outage

Slide 29

Slide 29 text

Hybrid Cloud Most companies run a hybrid cloud, which means the infrastructure is split (usually disproportionately) between on-prem and public cloud.

Slide 30

Slide 30 text

Backup & Restore Regular backups are copied to the recovery site. Infrastructure has to be spun up on the recovery site in the event of a disaster. RPO and RTO can be in hours, if not days. Inexpensive - Costs few hundred dollars a month for the storage.

Slide 31

Slide 31 text

Pilot Light Data is replicated asynchronously to the failover site Infrastructure is provisioned, but needs to be started before taking any traffic (RTO!) Data replication may be a few seconds/minutes behind (RPO!) Lower RTO and RPO than Backup & Restore, a bit more $$ for replication.

Slide 32

Slide 32 text

Warm Standby Scaled down infrastructure is provisioned, running, ready to take on traffic. May need to be scaled up to handle full production load (Autoscale!) Data replication may be a few seconds/minutes behind (RPO!) Lower RTO than Pilot Light, more $$ (why?)

Slide 33

Slide 33 text

Multi-Site Multiple sites taking live production traffic Difficult to pull off due to database constraints (multi-master, anyone?) When done right, RPO and RTO of a few seconds to few minutes Costs an arm and a leg

Slide 34

Slide 34 text

Multi Cloud Mother of them all. Automation to support multiple cloud providers, plus on- prem. RPO and RTO similar to multi-site, but provides isolation at a provider level. Costs an arm, a leg, and a kidney.

Slide 35

Slide 35 text

Fail Back Reverse the data flow Freeze the DR site Route traffic to primary site Unfreeze the DR site

Slide 36

Slide 36 text

So...

Slide 37

Slide 37 text

Survey the Land Start with measuring your current RTO and RPO.

Slide 38

Slide 38 text

Gather data You cannot improve what you cannot measure. Bonus - Detect anomalies across the board.

Slide 39

Slide 39 text

Runbooks Write them, and keep them updated.

Slide 40

Slide 40 text

Review your automation Automate the infrastructure build out, IaaC Follow the Pull-request model for infrastructure changes. Automating a destructive script (unintentionally) is the quickest way to a disaster. foreach ($env == ‘prod’); sudo chmod -R -rx

Slide 41

Slide 41 text

Practice the DR Plan!

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Failure-as-a-service Inject failures in the infrastructure. Measure of readiness. Chaos Engineering. Netflix - Simian Army Amazon Aurora Fault Injection Queries

Slide 44

Slide 44 text

Not all components are equal - neither should their DRs

Slide 45

Slide 45 text

Blast Radius A DNS failure can take down an entire data center. A faulty switch can take down entire subnet. A service failure can take down all others dependent on it. A Region failure has larger blast radius than an Availability Zone failure A Provider failure has larger blast radius than a Region failure.

Slide 46

Slide 46 text

Design for Fault Tolerance and Graceful Degradation Prefer evented over synchronous processing wherever possible Always assume failure In the cloud, there are no edge cases

Slide 47

Slide 47 text

Dashboards - Internal and External Service health monitoring is critical.. ..so is ensuring that the monitors themselves can survive a disaster.

Slide 48

Slide 48 text

Finally Make disaster recovery and high availability a topic of discussion during every stage of a project. Ask the hard questions. Embrace failure - learn from it.

Slide 49

Slide 49 text

We’re hiring!

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

Thank you! Manish Pandit Sr. Director of Engineering at Marqeta @lobster1234 lobster1234.github.io