EKS DR Investigation

1 EKS DR I N V E S T I
G A T I O N

Let's dive in the DR options for EKS EKS DR
Research

3 Flato Presentation EKS DR Let's investigate it Guto Carvalho
Cloud Native Engineer DR Active|Active DR Warm stand-by DR Restore from backup Key Principles Disaster Recovery 101 (DR) Other tools Final notes AWS Cross Region Replication

Introduction Disaster Recovery

Disaster Recovery What it means? 5 Service mesh introduction What
is it? D i s a s t e r r e c o v e r y ( D R ) i s a n organization's ability to respond to and recover from an event that negatively a ff ects business operations. What's the goal? The goal of DR methods is to enable the organization to regain use of critical systems and IT infrastructure as soon as possible after a disaster occurs. How we do that? We need to create a plan that address everything that we need to restore the systems, restore the operations and to minimise data and image loss.

EKS Disaster Recovery The idea and requirements 6 Service mesh
introduction AWS EKS Our goal is to fi nd a way to maintain services online, or at least, reduce the downtime of EKS Production cluster in case of a AWS Region outage Recovering To do that, we need to fi nd a way to restore the cluster in another region, the fastest way possible, using automation through infrastructure as code or other solution.

Methods that we can use Essentially we have 3 methods
7 Active-Active In that scenario we have 2 clusters running, we deploy on both clusters, we operate and monitor both clusters, in case of a fail on the primary, no problem at all, everything will be fi ne. Warm standby We have a smaller second cluster running, with the same workloads and con fi gurations in a tiny scale. In case of a disaster we can manually switch the LB/ DNS to the second cluster and scale up. No downtime   Very expensive   Easy to manage   Data replication can be challenging Restore from backup Here we need to recreate the cluster in another region (with automation) and restore everything from the backup system Some downtime (minutes > hours) Expensive   Somewhat easy to manage   Data replication can be challenging Downtime (could be hours > days) Not that expensive   Hard to do and manage due to uncertainties   No need to replicate data   Restore from backup can take time

Methods that we can use EKS DR Investigation 8 Service
mesh introduction Options > Active-Active   > Warm-standby   > Restore from backup Solutions For each method, we have a way to restore the business. We can use di ff erent tools to do it. Let's see some options for each one of them.

Let's understand how it works DR Active Active

[Active|Active] EKS DR Investigation 10 Primary Secondary Same size  
Same con fi gurations   Same workloads   Replica of stateful volumes (complex) EKS Cluster EKS Cluster Deploy EKS Clusters No downtime   Very expensive   Easy to manage   Data replication can be challenging External LB is needed   External DNS is needed Pros/Cons Volume Replication  

Some options 11 Two EKS Clusters in di ff erent
regions terraform argocd   or fl ux AWS cloud fl are Deployment in both clusters Volume management/replication DNS Provider [Active|Active] helm

How to reduce the cost of the Active/Active? 12 DR
[Active|Active] cast.ai https://cast.ai/cost-optimizer/ Smaller instances On the second active cluster que can set smaller instances with a strong autoscale con fi guration. Analysers Tools spot.io https://cast.ai/cost-optimizer/ harness.io https://harness.io/products/cloud-cost/ Instances With SaaS analysers we can fi nd and adjust pods and nodes reducing cloud costs Spot.io can facilite the use of spot instances to reduce the instances costs The Load Balancer can send 25% of the tra ffi c to the cluster B, and in case of outage, the cluster can scale out to fi t the user needs

Multi Region EKS Cluster can be complex and expensive 13
[Active|Active] AWS Backup for EFS   S3 Replication   EBS Snapshot   RDS Snapshot   Cross-region-backup   Multi-AZ   IAM   EC2 EKS   VPC

Let's understand how it works AWS Backup & Cross Region
Features

RDS 15 AWS Cross region replication Cross-region Features We can
create RDS Cross Region Read Replicas for unencrypted MySQL and PostgreSQL databases instances. Disaster Recovery: You can create cross-region read replicas of your primary database instance to have a disaster recovery solution. If your primary region faces a disruption, you can promote the replica to a master and keep your business operational. Scaling: You can use cross-region read replicas to support read queries from your workloads across various geographic locations. This will reduce latency by serving your customers from a database that is close to them. Cross-region Migration: If you would like to migrate your database instance quickly to another AWS region, you may do so by using cross-region replication. Simply create a replica in your target region, and once it is ready, promote it to master and point your application to it.

S3 16 AWS Cross region replication Cross-region Features We can
create S3 Cross Region Replica, every object updated to in the origin S3 bucket will be replicated to the destination bucket. Disaster Recovery: You can create cross-region replicas of your primary bucket to have a disaster recovery solution. If your primary region faces a disruption, you can point your APP to the replica bucket and keep your business operational. Lower latency: you can use cross-region replication to provide lower-latency data access in di ff erent geographic regions Cross-region Migration: If you would like to migrate your bucket quickly to another AWS region, you may do so by using cross-region replication. Simply create a replica in your target region, and once it is ready, point your application to it. No extra costs: You pay Amazon S3’s usual charges for storage, requests, and inter-region data transfer for the replicated copy of data.   Cross-region replication is available in the US Standard, US West (Oregon), US West (N. California), EU (Ireland), EU (Frankfurt), Asia Paci fi c (Tokyo), Asia Paci fi c (Singapore), Asia Paci fi c (Sydney), and South America (Sao Paulo) regions.

EBS Snapshot 17 AWS Cross Region Copy Cross-region Features We
can create a EBS Snapshot copy across AWS regions. Disaster Recovery: You can create cross-region copy's of your EBS snapshots to have a disaster recovery solution. If your primary region faces a disruption, you can restore copy and restore the snapshot in another region, mount it in your cluster and point your app to the new pv.     We need to automate the copy, restores and remotes somehow with IaC, AWS only o ff ers the copy not the replica, it's pretty much a manual procedure.

Cross Region Backup 18 AWS Backup We can create backups
of several services in another region.   They o ff er scheduling, policy based backup, tag-based backup, retention management, backup audit, lifecycle policies, data encryption, cross- account backup, access policies and even item level recovery for EFS. It's a fantastic solution, but we need to adjust and con fi gure everything on the backup region to get things done and running like the original region. We'll need to use heavy automation to have everything operational again. Services covered   EC2 Instances   EBS   RDS   DynamoDB   EFS Storage Gateway   Neptune   DocumentDB   S3 (Preview)

Cross Region synchronisation 19 AWS Data sync AWS DataSync is
a cloud migration tool, often used to migrate data from on-premises to AWS, but you can also use it to sync data between AWS Storage services. DataSync o ff ers another way to sync S3 and EFS between regions, it's simple and low cost. With AWS DataSync you can easily replicate, archive, or share application data. Services covered   EFS   S3   FSX Windows Server FSX Lustre Price   1TB month   1,000 GB x 0.0125 USD = 12.50 USD     Total: 12.50 USD

Cross Region Replica and Copy 20 AWS Calculator Trying to
fi nd a way to calculate that, for now AWS calculador don't o ff er that type of estimation.

Warm stand-by Let's understand how it works

Warm Stand-by EKS DR Investigation 22 Primary Secondary Di ff
erent sizes   Same con fi gurations   Same workloads (replica 0)   Restore only volumes from backup EKS Cluster EKS Cluster Deploy EKS Clusters Some downtime   Expensive   Somewhat easy to manage   Restore can take time   External LB is needed   External DNS is needed   Need to scale UP after switch Pros/Cons Option 2 K8S Backup restore   volumes Secundary Secondary

regions terraform argocd   or fl ux AWS cloud fl are Deployment in both clusters DNS Provider Warm Stand-by Backup/Restore velero Volume management / cross-region-replica helm

Restore from backup Let's understand how it works

Restore from backup EKS DR Investigation 25 new cluster Need
to create a new cluster   Restore K8S Backups   - Workloads backups   - Volumes Backups EKS Cluster EKS Clusters Downtime (could be hours > days) Not that expensive   Hard to do and manage due to uncertainties   No need to replicate data   No need to reinstall everything   Restore from backup can take time   Need an external LB Need an external DNS Pros/Cons Option 1 K8S Workloads   Backup K8S Volumes   Backup restore restore Cluster   Provisioner   restore

Restore from backup EKS DR Investigation 26 new cluster Need
to create a new cluster   GitOps will recreate the con fi guration   Need to restore the aws backup EKS Cluster EKS Cluster Downtime (could be hours > days) Not that expensive   Hard to do and manage due to uncertainties   Need to replicate data across regions   Restore from AWS backup can take time   Need to reinstall everything   Need an external LB Need an external DNS Pros/Cons Option 2 Cluster   Provisioner   Storages   Backup from AWS restore Apps   Provisioner   restore restore

regions terraform argocd   or fl ux longhorn cloud fl are Deployment in both clusters Volume management/backup DNS Provider Restore from backup Backup/Restore velero If the primary cluster is using longhorn instead of EBS/EFS for volumes, it's easy to restore statefulset volumes on the new cluster Longhorn Helm can facilitate the deployment of the workloads and help simplify package management of the cluster, especially between di ff erent environments Helm helm longhorn Option 1

regions terraform argocd   or fl ux AWS Backup   AWS Datasync cloud fl are Deployment in both clusters Volume management/sync/backup DNS Provider Restore from backup Helm can facilitate the deployment of the workloads and help simplify package management of the cluster, especially between di ff erent environments Helm helm Option 2

How to reduce the the downtime of Backup/Restore? 29 Restore
from backup Using automation is the only way to reduce the downtime, it's an essential and vital part of the DR Plan. Automation If the restore plan is important, the order to restore things is the most important part. We need to be sure the fastest and secure way to have everything up and running in the right order. Restore Plan It's essential to test the disaster recovery plan to ensure that it is still viable and that it covers everything we need. During the test, we can enhance the procedure documentation and, make the necessary adjustments, always following a continuous improvement strategy. Restore Test Using an external SaaS backup can help to backup and sometimes to speed up the restore. SaaS Backup Some companies o ff ers the entire DR as a service. SaaS DR

Things that we need to do to have a good
DR strategy Key principles

Key principles for a good DR strategy 31 Service mesh
introduction External DNS An external DNS can help to switch to another cloud provider or region fast, but it can have issues with dns zone cache and TTL. External Load Balancer An external DNS can help to switch to another cloud provider or region fast, di ff erent from DNS, there is no gap in the client side. External database External database like atlas can be very useful because you can move your workloads without move your data or sgbd. External Object Storage External object storage like Minio can be very useful because you can move your workloads without move your static data. Replication In case of statefulsets, we need to replicate the volumes somehow, this can be challenging and complex, but it's the only way to do a DR External Observability It's essential to maintain the visibility of your services and metrics, if it's running on the same region, you may lose your eyes and will operate on the dark. External Backup solution External backup service (SaaS) is essential to have speed, security and to be able to restore your statefulsets to any cloud provider you want, especially if you don't have a replication solution up and running. External Log solution External backup service (SaaS) is essential to have speed, security and to be able to restore your statefulsets to any cloud provider you want, especially if you don't have a replication solution up and running.

Key principles for a good DR strategy 32 Service mesh
introduction Statefulsets to object storage It's important to map the statefulset apps and refactor to use "Object Storage" instead. With that, we can move to another region with more fl exibility and we can also use HPA for the App. External or Multi-Region Database It's important to use external databases like "Atlas" or Cloud Databases with Multi-Region HA / Replication.

Same goals, closed tools Other tools

Other tools Closed Solutions 34 SaaS Arpio.io It's an AWS
replication service, it can replicate con fi guration and data from a region to another. Kasten.io It's a kubernetes backup system....   Cohesity.com It's a kubernetes backup system.... Portworx.com It's a kubernetes backup system.... Trillio.io It's a kubernetes backup system....

Other tools Read more... 35 Arpio.io https://arpio.io/how-it-works/ Kasten.io https://www.kasten.io/kubernetes/use-cases/disaster-recovery/ Cohesity.com
https://www.cohesity.com/products/sitecontinuity-for-disaster-recovery-as-a-service/ Portworx.com https://portworx.com/products/px-backup/ Trillio.io https://trilio.io/triliovault-for-kubernetes/

EKS DR Final notes

Final notes EKS Disaster Recovery Investigation 37 The ideal option
is to have Active/Active Clusters, but the volume replication can be challenging, and the cost will double - at least. The cheapest and more a ff ordable option is to use backup and restore in another region, but the downtime can be ample. It will a ff ect the users and the business and be bad for the company's image. Something between is to have the second cluster in passive mode (warm standby) with 1/4 of the size of the main cluster. This cluster will have the same workloads and con fi gurations – pods stopped. You only need to restore the volumes and start your apps. The downtime will be inferior to the backup and restore, but it's still there. The cost is superior to backup and restore, it's not double, but it's something to consider.

Final notes AWS Backup and Region Replication 38 It's complex,
even with automation, it's expensive, it's not that easy to maintain, manage and operate and the cost cloud be inviable if we compare with the bene fi ts.

Final notes Use of a Cloud Native Storage 39 Move
from EBS/EFS to a Cloud Native Storage can o ff er more fl exibility to backup and restore volumes, but it's one more tool to manage and the performance – with one more layer – can be inferior from EBS/EFS, even with tuning and fast ssd disks.

Final notes Active Monitoring and Metrics 40 To prevent downtime,
you need a very sophisticated monitoring and metric system. With this, you can notice the bad behavior of your Apps like spikes and intermittences before it's become a problem. With that, you can move fast to another cloud provider or at least to another region in your cloud provider. Perhaps one of the most important things is to have your monitoring and metric system in a di ff erent place than your APPs are running, like another cloud provider or a SaaS Service. This strategy is essential to avoid losing the visibility of your systems, services, apps, and cloud infrastructure.

Final notes Backup 41 We have to do the backups
in a di ff erent cloud provider or at least in another region of your cloud provider. It's the same strategy as the monitoring and metric system. It needs to be in another cloud region than the region with problems to be available.

Final notes Our suggestion 42 In our opinion, you should
use the option warm-standby if the idea is to restore systems and services with minimal downtime. Why? It's the best costs vs. Bene fi ts from the three scenarios. Bene fi ts? The downtime will be inferior to Backup/Restore, and the service will be stable in another region. Cost? The cost will be inferior to Active/Active with a good response time to restore systems and services. Complexity? It's not that complex. With terraform and GitOps we can manage it. Learning Curve? It's small to medium for the client DevOps team. Estimated time to achieve that? I would say 160h for the initial setup, tests, and documentation for the fi rst phase.

Final notes Too much information? 43 It would help if
you decided how much availability you want. With that settled, we can suggest a proper solution. This investigation was merely a start point to create a DR strategy and a robust DR Plan to: - Move to another cloud region - Move to another cloud provider     Just keep in mind that to migrate/restore faster and with minimal downtime, the use of stateless applications is essential.

[email protected] Photos:   Unsplash   Kubecon

Flato Add your header here 45 Add Your Sub-header Here
Lorem Ipsum Flato Presentation Thank You

EKS DR Investigation

EKS DR Investigation

More Decks by Guto Carvalho

Other Decks in Technology

Featured

Transcript