is it? D i s a s t e r r e c o v e r y ( D R ) i s a n organization's ability to respond to and recover from an event that negatively a ff ects business operations. What's the goal? The goal of DR methods is to enable the organization to regain use of critical systems and IT infrastructure as soon as possible after a disaster occurs. How we do that? We need to create a plan that address everything that we need to restore the systems, restore the operations and to minimise data and image loss.
introduction AWS EKS Our goal is to fi nd a way to maintain services online, or at least, reduce the downtime of EKS Production cluster in case of a AWS Region outage Recovering To do that, we need to fi nd a way to restore the cluster in another region, the fastest way possible, using automation through infrastructure as code or other solution.
7 Active-Active In that scenario we have 2 clusters running, we deploy on both clusters, we operate and monitor both clusters, in case of a fail on the primary, no problem at all, everything will be fi ne. Warm standby We have a smaller second cluster running, with the same workloads and con fi gurations in a tiny scale. In case of a disaster we can manually switch the LB/ DNS to the second cluster and scale up. No downtime Very expensive Easy to manage Data replication can be challenging Restore from backup Here we need to recreate the cluster in another region (with automation) and restore everything from the backup system Some downtime (minutes > hours) Expensive Somewhat easy to manage Data replication can be challenging Downtime (could be hours > days) Not that expensive Hard to do and manage due to uncertainties No need to replicate data Restore from backup can take time
mesh introduction Options > Active-Active > Warm-standby > Restore from backup Solutions For each method, we have a way to restore the business. We can use di ff erent tools to do it. Let's see some options for each one of them.
Same con fi gurations Same workloads Replica of stateful volumes (complex) EKS Cluster EKS Cluster Deploy EKS Clusters No downtime Very expensive Easy to manage Data replication can be challenging External LB is needed External DNS is needed Pros/Cons Volume Replication
[Active|Active] cast.ai https://cast.ai/cost-optimizer/ Smaller instances On the second active cluster que can set smaller instances with a strong autoscale con fi guration. Analysers Tools spot.io https://cast.ai/cost-optimizer/ harness.io https://harness.io/products/cloud-cost/ Instances With SaaS analysers we can fi nd and adjust pods and nodes reducing cloud costs Spot.io can facilite the use of spot instances to reduce the instances costs The Load Balancer can send 25% of the tra ffi c to the cluster B, and in case of outage, the cluster can scale out to fi t the user needs
create RDS Cross Region Read Replicas for unencrypted MySQL and PostgreSQL databases instances. Disaster Recovery: You can create cross-region read replicas of your primary database instance to have a disaster recovery solution. If your primary region faces a disruption, you can promote the replica to a master and keep your business operational. Scaling: You can use cross-region read replicas to support read queries from your workloads across various geographic locations. This will reduce latency by serving your customers from a database that is close to them. Cross-region Migration: If you would like to migrate your database instance quickly to another AWS region, you may do so by using cross-region replication. Simply create a replica in your target region, and once it is ready, promote it to master and point your application to it.
create S3 Cross Region Replica, every object updated to in the origin S3 bucket will be replicated to the destination bucket. Disaster Recovery: You can create cross-region replicas of your primary bucket to have a disaster recovery solution. If your primary region faces a disruption, you can point your APP to the replica bucket and keep your business operational. Lower latency: you can use cross-region replication to provide lower-latency data access in di ff erent geographic regions Cross-region Migration: If you would like to migrate your bucket quickly to another AWS region, you may do so by using cross-region replication. Simply create a replica in your target region, and once it is ready, point your application to it. No extra costs: You pay Amazon S3’s usual charges for storage, requests, and inter-region data transfer for the replicated copy of data. Cross-region replication is available in the US Standard, US West (Oregon), US West (N. California), EU (Ireland), EU (Frankfurt), Asia Paci fi c (Tokyo), Asia Paci fi c (Singapore), Asia Paci fi c (Sydney), and South America (Sao Paulo) regions.
can create a EBS Snapshot copy across AWS regions. Disaster Recovery: You can create cross-region copy's of your EBS snapshots to have a disaster recovery solution. If your primary region faces a disruption, you can restore copy and restore the snapshot in another region, mount it in your cluster and point your app to the new pv. We need to automate the copy, restores and remotes somehow with IaC, AWS only o ff ers the copy not the replica, it's pretty much a manual procedure.
of several services in another region. They o ff er scheduling, policy based backup, tag-based backup, retention management, backup audit, lifecycle policies, data encryption, cross- account backup, access policies and even item level recovery for EFS. It's a fantastic solution, but we need to adjust and con fi gure everything on the backup region to get things done and running like the original region. We'll need to use heavy automation to have everything operational again. Services covered EC2 Instances EBS RDS DynamoDB EFS Storage Gateway Neptune DocumentDB S3 (Preview)
a cloud migration tool, often used to migrate data from on-premises to AWS, but you can also use it to sync data between AWS Storage services. DataSync o ff ers another way to sync S3 and EFS between regions, it's simple and low cost. With AWS DataSync you can easily replicate, archive, or share application data. Services covered EFS S3 FSX Windows Server FSX Lustre Price 1TB month 1,000 GB x 0.0125 USD = 12.50 USD Total: 12.50 USD
erent sizes Same con fi gurations Same workloads (replica 0) Restore only volumes from backup EKS Cluster EKS Cluster Deploy EKS Clusters Some downtime Expensive Somewhat easy to manage Restore can take time External LB is needed External DNS is needed Need to scale UP after switch Pros/Cons Option 2 K8S Backup restore volumes Secundary Secondary
to create a new cluster Restore K8S Backups - Workloads backups - Volumes Backups EKS Cluster EKS Clusters Downtime (could be hours > days) Not that expensive Hard to do and manage due to uncertainties No need to replicate data No need to reinstall everything Restore from backup can take time Need an external LB Need an external DNS Pros/Cons Option 1 K8S Workloads Backup K8S Volumes Backup restore restore Cluster Provisioner restore
to create a new cluster GitOps will recreate the con fi guration Need to restore the aws backup EKS Cluster EKS Cluster Downtime (could be hours > days) Not that expensive Hard to do and manage due to uncertainties Need to replicate data across regions Restore from AWS backup can take time Need to reinstall everything Need an external LB Need an external DNS Pros/Cons Option 2 Cluster Provisioner Storages Backup from AWS restore Apps Provisioner restore restore
regions terraform argocd or fl ux longhorn cloud fl are Deployment in both clusters Volume management/backup DNS Provider Restore from backup Backup/Restore velero If the primary cluster is using longhorn instead of EBS/EFS for volumes, it's easy to restore statefulset volumes on the new cluster Longhorn Helm can facilitate the deployment of the workloads and help simplify package management of the cluster, especially between di ff erent environments Helm helm longhorn Option 1
regions terraform argocd or fl ux AWS Backup AWS Datasync cloud fl are Deployment in both clusters Volume management/sync/backup DNS Provider Restore from backup Helm can facilitate the deployment of the workloads and help simplify package management of the cluster, especially between di ff erent environments Helm helm Option 2
from backup Using automation is the only way to reduce the downtime, it's an essential and vital part of the DR Plan. Automation If the restore plan is important, the order to restore things is the most important part. We need to be sure the fastest and secure way to have everything up and running in the right order. Restore Plan It's essential to test the disaster recovery plan to ensure that it is still viable and that it covers everything we need. During the test, we can enhance the procedure documentation and, make the necessary adjustments, always following a continuous improvement strategy. Restore Test Using an external SaaS backup can help to backup and sometimes to speed up the restore. SaaS Backup Some companies o ff ers the entire DR as a service. SaaS DR
introduction External DNS An external DNS can help to switch to another cloud provider or region fast, but it can have issues with dns zone cache and TTL. External Load Balancer An external DNS can help to switch to another cloud provider or region fast, di ff erent from DNS, there is no gap in the client side. External database External database like atlas can be very useful because you can move your workloads without move your data or sgbd. External Object Storage External object storage like Minio can be very useful because you can move your workloads without move your static data. Replication In case of statefulsets, we need to replicate the volumes somehow, this can be challenging and complex, but it's the only way to do a DR External Observability It's essential to maintain the visibility of your services and metrics, if it's running on the same region, you may lose your eyes and will operate on the dark. External Backup solution External backup service (SaaS) is essential to have speed, security and to be able to restore your statefulsets to any cloud provider you want, especially if you don't have a replication solution up and running. External Log solution External backup service (SaaS) is essential to have speed, security and to be able to restore your statefulsets to any cloud provider you want, especially if you don't have a replication solution up and running.
introduction Statefulsets to object storage It's important to map the statefulset apps and refactor to use "Object Storage" instead. With that, we can move to another region with more fl exibility and we can also use HPA for the App. External or Multi-Region Database It's important to use external databases like "Atlas" or Cloud Databases with Multi-Region HA / Replication.
replication service, it can replicate con fi guration and data from a region to another. Kasten.io It's a kubernetes backup system.... Cohesity.com It's a kubernetes backup system.... Portworx.com It's a kubernetes backup system.... Trillio.io It's a kubernetes backup system....
is to have Active/Active Clusters, but the volume replication can be challenging, and the cost will double - at least. The cheapest and more a ff ordable option is to use backup and restore in another region, but the downtime can be ample. It will a ff ect the users and the business and be bad for the company's image. Something between is to have the second cluster in passive mode (warm standby) with 1/4 of the size of the main cluster. This cluster will have the same workloads and con fi gurations – pods stopped. You only need to restore the volumes and start your apps. The downtime will be inferior to the backup and restore, but it's still there. The cost is superior to backup and restore, it's not double, but it's something to consider.
from EBS/EFS to a Cloud Native Storage can o ff er more fl exibility to backup and restore volumes, but it's one more tool to manage and the performance – with one more layer – can be inferior from EBS/EFS, even with tuning and fast ssd disks.
you need a very sophisticated monitoring and metric system. With this, you can notice the bad behavior of your Apps like spikes and intermittences before it's become a problem. With that, you can move fast to another cloud provider or at least to another region in your cloud provider. Perhaps one of the most important things is to have your monitoring and metric system in a di ff erent place than your APPs are running, like another cloud provider or a SaaS Service. This strategy is essential to avoid losing the visibility of your systems, services, apps, and cloud infrastructure.
in a di ff erent cloud provider or at least in another region of your cloud provider. It's the same strategy as the monitoring and metric system. It needs to be in another cloud region than the region with problems to be available.
use the option warm-standby if the idea is to restore systems and services with minimal downtime. Why? It's the best costs vs. Bene fi ts from the three scenarios. Bene fi ts? The downtime will be inferior to Backup/Restore, and the service will be stable in another region. Cost? The cost will be inferior to Active/Active with a good response time to restore systems and services. Complexity? It's not that complex. With terraform and GitOps we can manage it. Learning Curve? It's small to medium for the client DevOps team. Estimated time to achieve that? I would say 160h for the initial setup, tests, and documentation for the fi rst phase.
you decided how much availability you want. With that settled, we can suggest a proper solution. This investigation was merely a start point to create a DR strategy and a robust DR Plan to: - Move to another cloud region - Move to another cloud provider Just keep in mind that to migrate/restore faster and with minimal downtime, the use of stateless applications is essential.