Slide 1

Slide 1 text

1 EKS DR I N V E S T I G A T I O N

Slide 2

Slide 2 text

Let's dive in the DR options for EKS EKS DR Research

Slide 3

Slide 3 text

3 Flato Presentation EKS DR Let's investigate it Guto Carvalho Cloud Native Engineer DR Active|Active DR Warm stand-by DR Restore from backup Key Principles Disaster Recovery 101 (DR) Other tools Final notes AWS Cross Region Replication

Slide 4

Slide 4 text

Introduction Disaster Recovery

Slide 5

Slide 5 text

Disaster Recovery What it means? 5 Service mesh introduction What is it? D i s a s t e r r e c o v e r y ( D R ) i s a n organization's ability to respond to and recover from an event that negatively a ff ects business operations. What's the goal? The goal of DR methods is to enable the organization to regain use of critical systems and IT infrastructure as soon as possible after a disaster occurs. How we do that? We need to create a plan that address everything that we need to restore the systems, restore the operations and to minimise data and image loss.

Slide 6

Slide 6 text

EKS Disaster Recovery The idea and requirements 6 Service mesh introduction AWS EKS Our goal is to fi nd a way to maintain services online, or at least, reduce the downtime of EKS Production cluster in case of a AWS Region outage Recovering To do that, we need to fi nd a way to restore the cluster in another region, the fastest way possible, using automation through infrastructure as code or other solution.

Slide 7

Slide 7 text

Methods that we can use Essentially we have 3 methods 7 Active-Active In that scenario we have 2 clusters running, we deploy on both clusters, we operate and monitor both clusters, in case of a fail on the primary, no problem at all, everything will be fi ne. Warm standby We have a smaller second cluster running, with the same workloads and con fi gurations in a tiny scale. In case of a disaster we can manually switch the LB/ DNS to the second cluster and scale up. No downtime 
 Very expensive 
 Easy to manage 
 Data replication can be challenging Restore from backup Here we need to recreate the cluster in another region (with automation) and restore everything from the backup system Some downtime (minutes > hours) Expensive 
 Somewhat easy to manage 
 Data replication can be challenging Downtime (could be hours > days) Not that expensive 
 Hard to do and manage due to uncertainties 
 No need to replicate data 
 Restore from backup can take time

Slide 8

Slide 8 text

Methods that we can use EKS DR Investigation 8 Service mesh introduction Options > Active-Active 
 > Warm-standby 
 > Restore from backup Solutions For each method, we have a way to restore the business. We can use di ff erent tools to do it. Let's see some options for each one of them.

Slide 9

Slide 9 text

Let's understand how it works DR Active Active

Slide 10

Slide 10 text

[Active|Active] EKS DR Investigation 10 Primary Secondary Same size 
 Same con fi gurations 
 Same workloads 
 Replica of stateful volumes (complex) EKS Cluster EKS Cluster Deploy EKS Clusters No downtime 
 Very expensive 
 Easy to manage 
 Data replication can be challenging External LB is needed 
 External DNS is needed Pros/Cons Volume Replication 


Slide 11

Slide 11 text

Some options 11 Two EKS Clusters in di ff erent regions terraform argocd 
 or fl ux AWS cloud fl are Deployment in both clusters Volume management/replication DNS Provider [Active|Active] helm

Slide 12

Slide 12 text

How to reduce the cost of the Active/Active? 12 DR [Active|Active] cast.ai https://cast.ai/cost-optimizer/ Smaller instances On the second active cluster que can set smaller instances with a strong autoscale con fi guration. Analysers Tools spot.io https://cast.ai/cost-optimizer/ harness.io https://harness.io/products/cloud-cost/ Instances With SaaS analysers we can fi nd and adjust pods and nodes reducing cloud costs Spot.io can facilite the use of spot instances to reduce the instances costs The Load Balancer can send 25% of the tra ffi c to the cluster B, and in case of outage, the cluster can scale out to fi t the user needs

Slide 13

Slide 13 text

Multi Region EKS Cluster can be complex and expensive 13 [Active|Active] AWS Backup for EFS 
 S3 Replication 
 EBS Snapshot 
 RDS Snapshot 
 Cross-region-backup 
 Multi-AZ 
 IAM 
 EC2 EKS 
 VPC

Slide 14

Slide 14 text

Let's understand how it works AWS Backup & Cross Region Features

Slide 15

Slide 15 text

RDS 15 AWS Cross region replication Cross-region Features We can create RDS Cross Region Read Replicas for unencrypted MySQL and PostgreSQL databases instances. Disaster Recovery: You can create cross-region read replicas of your primary database instance to have a disaster recovery solution. If your primary region faces a disruption, you can promote the replica to a master and keep your business operational. Scaling: You can use cross-region read replicas to support read queries from your workloads across various geographic locations. This will reduce latency by serving your customers from a database that is close to them. Cross-region Migration: If you would like to migrate your database instance quickly to another AWS region, you may do so by using cross-region replication. Simply create a replica in your target region, and once it is ready, promote it to master and point your application to it.

Slide 16

Slide 16 text

S3 16 AWS Cross region replication Cross-region Features We can create S3 Cross Region Replica, every object updated to in the origin S3 bucket will be replicated to the destination bucket. Disaster Recovery: You can create cross-region replicas of your primary bucket to have a disaster recovery solution. If your primary region faces a disruption, you can point your APP to the replica bucket and keep your business operational. Lower latency: you can use cross-region replication to provide lower-latency data access in di ff erent geographic regions Cross-region Migration: If you would like to migrate your bucket quickly to another AWS region, you may do so by using cross-region replication. Simply create a replica in your target region, and once it is ready, point your application to it. No extra costs: You pay Amazon S3’s usual charges for storage, requests, and inter-region data transfer for the replicated copy of data. 
 Cross-region replication is available in the US Standard, US West (Oregon), US West (N. California), EU (Ireland), EU (Frankfurt), Asia Paci fi c (Tokyo), Asia Paci fi c (Singapore), Asia Paci fi c (Sydney), and South America (Sao Paulo) regions.

Slide 17

Slide 17 text

EBS Snapshot 17 AWS Cross Region Copy Cross-region Features We can create a EBS Snapshot copy across AWS regions. Disaster Recovery: You can create cross-region copy's of your EBS snapshots to have a disaster recovery solution. If your primary region faces a disruption, you can restore copy and restore the snapshot in another region, mount it in your cluster and point your app to the new pv. 
 
 We need to automate the copy, restores and remotes somehow with IaC, AWS only o ff ers the copy not the replica, it's pretty much a manual procedure.

Slide 18

Slide 18 text

Cross Region Backup 18 AWS Backup We can create backups of several services in another region. 
 They o ff er scheduling, policy based backup, tag-based backup, retention management, backup audit, lifecycle policies, data encryption, cross- account backup, access policies and even item level recovery for EFS. It's a fantastic solution, but we need to adjust and con fi gure everything on the backup region to get things done and running like the original region. We'll need to use heavy automation to have everything operational again. Services covered 
 EC2 Instances 
 EBS 
 RDS 
 DynamoDB 
 EFS Storage Gateway 
 Neptune 
 DocumentDB 
 S3 (Preview)

Slide 19

Slide 19 text

Cross Region synchronisation 19 AWS Data sync AWS DataSync is a cloud migration tool, often used to migrate data from on-premises to AWS, but you can also use it to sync data between AWS Storage services. DataSync o ff ers another way to sync S3 and EFS between regions, it's simple and low cost. With AWS DataSync you can easily replicate, archive, or share application data. Services covered 
 EFS 
 S3 
 FSX Windows Server FSX Lustre Price 
 1TB month 
 1,000 GB x 0.0125 USD = 12.50 USD 
 
 Total: 12.50 USD

Slide 20

Slide 20 text

Cross Region Replica and Copy 20 AWS Calculator Trying to fi nd a way to calculate that, for now AWS calculador don't o ff er that type of estimation.

Slide 21

Slide 21 text

Warm stand-by Let's understand how it works

Slide 22

Slide 22 text

Warm Stand-by EKS DR Investigation 22 Primary Secondary Di ff erent sizes 
 Same con fi gurations 
 Same workloads (replica 0) 
 Restore only volumes from backup EKS Cluster EKS Cluster Deploy EKS Clusters Some downtime 
 Expensive 
 Somewhat easy to manage 
 Restore can take time 
 External LB is needed 
 External DNS is needed 
 Need to scale UP after switch Pros/Cons Option 2 K8S Backup restore 
 volumes Secundary Secondary

Slide 23

Slide 23 text

Some options 23 Two EKS Clusters in di ff erent regions terraform argocd 
 or fl ux AWS cloud fl are Deployment in both clusters DNS Provider Warm Stand-by Backup/Restore velero Volume management / cross-region-replica helm

Slide 24

Slide 24 text

Restore from backup Let's understand how it works

Slide 25

Slide 25 text

Restore from backup EKS DR Investigation 25 new cluster Need to create a new cluster 
 Restore K8S Backups 
 - Workloads backups 
 - Volumes Backups EKS Cluster EKS Clusters Downtime (could be hours > days) Not that expensive 
 Hard to do and manage due to uncertainties 
 No need to replicate data 
 No need to reinstall everything 
 Restore from backup can take time 
 Need an external LB Need an external DNS Pros/Cons Option 1 K8S Workloads 
 Backup K8S Volumes 
 Backup restore restore Cluster 
 Provisioner 
 restore

Slide 26

Slide 26 text

Restore from backup EKS DR Investigation 26 new cluster Need to create a new cluster 
 GitOps will recreate the con fi guration 
 Need to restore the aws backup EKS Cluster EKS Cluster Downtime (could be hours > days) Not that expensive 
 Hard to do and manage due to uncertainties 
 Need to replicate data across regions 
 Restore from AWS backup can take time 
 Need to reinstall everything 
 Need an external LB Need an external DNS Pros/Cons Option 2 Cluster 
 Provisioner 
 Storages 
 Backup from AWS restore Apps 
 Provisioner 
 restore restore

Slide 27

Slide 27 text

Some options 27 Two EKS Clusters in di ff erent regions terraform argocd 
 or fl ux longhorn cloud fl are Deployment in both clusters Volume management/backup DNS Provider Restore from backup Backup/Restore velero If the primary cluster is using longhorn instead of EBS/EFS for volumes, it's easy to restore statefulset volumes on the new cluster Longhorn Helm can facilitate the deployment of the workloads and help simplify package management of the cluster, especially between di ff erent environments Helm helm longhorn Option 1

Slide 28

Slide 28 text

Some options 28 Two EKS Clusters in di ff erent regions terraform argocd 
 or fl ux AWS Backup 
 AWS Datasync cloud fl are Deployment in both clusters Volume management/sync/backup DNS Provider Restore from backup Helm can facilitate the deployment of the workloads and help simplify package management of the cluster, especially between di ff erent environments Helm helm Option 2

Slide 29

Slide 29 text

How to reduce the the downtime of Backup/Restore? 29 Restore from backup Using automation is the only way to reduce the downtime, it's an essential and vital part of the DR Plan. Automation If the restore plan is important, the order to restore things is the most important part. We need to be sure the fastest and secure way to have everything up and running in the right order. Restore Plan It's essential to test the disaster recovery plan to ensure that it is still viable and that it covers everything we need. During the test, we can enhance the procedure documentation and, make the necessary adjustments, always following a continuous improvement strategy. Restore Test Using an external SaaS backup can help to backup and sometimes to speed up the restore. SaaS Backup Some companies o ff ers the entire DR as a service. SaaS DR

Slide 30

Slide 30 text

Things that we need to do to have a good DR strategy Key principles

Slide 31

Slide 31 text

Key principles for a good DR strategy 31 Service mesh introduction External DNS An external DNS can help to switch to another cloud provider or region fast, but it can have issues with dns zone cache and TTL. External Load Balancer An external DNS can help to switch to another cloud provider or region fast, di ff erent from DNS, there is no gap in the client side. External database External database like atlas can be very useful because you can move your workloads without move your data or sgbd. External Object Storage External object storage like Minio can be very useful because you can move your workloads without move your static data. Replication In case of statefulsets, we need to replicate the volumes somehow, this can be challenging and complex, but it's the only way to do a DR External Observability It's essential to maintain the visibility of your services and metrics, if it's running on the same region, you may lose your eyes and will operate on the dark. External Backup solution External backup service (SaaS) is essential to have speed, security and to be able to restore your statefulsets to any cloud provider you want, especially if you don't have a replication solution up and running. External Log solution External backup service (SaaS) is essential to have speed, security and to be able to restore your statefulsets to any cloud provider you want, especially if you don't have a replication solution up and running.

Slide 32

Slide 32 text

Key principles for a good DR strategy 32 Service mesh introduction Statefulsets to object storage It's important to map the statefulset apps and refactor to use "Object Storage" instead. With that, we can move to another region with more fl exibility and we can also use HPA for the App. External or Multi-Region Database It's important to use external databases like "Atlas" or Cloud Databases with Multi-Region HA / Replication.

Slide 33

Slide 33 text

Same goals, closed tools Other tools

Slide 34

Slide 34 text

Other tools Closed Solutions 34 SaaS Arpio.io It's an AWS replication service, it can replicate con fi guration and data from a region to another. Kasten.io It's a kubernetes backup system.... 
 Cohesity.com It's a kubernetes backup system.... Portworx.com It's a kubernetes backup system.... Trillio.io It's a kubernetes backup system....

Slide 35

Slide 35 text

Other tools Read more... 35 Arpio.io https://arpio.io/how-it-works/ Kasten.io https://www.kasten.io/kubernetes/use-cases/disaster-recovery/ Cohesity.com https://www.cohesity.com/products/sitecontinuity-for-disaster-recovery-as-a-service/ Portworx.com https://portworx.com/products/px-backup/ Trillio.io https://trilio.io/triliovault-for-kubernetes/

Slide 36

Slide 36 text

EKS DR Final notes

Slide 37

Slide 37 text

Final notes EKS Disaster Recovery Investigation 37 The ideal option is to have Active/Active Clusters, but the volume replication can be challenging, and the cost will double - at least. The cheapest and more a ff ordable option is to use backup and restore in another region, but the downtime can be ample. It will a ff ect the users and the business and be bad for the company's image. Something between is to have the second cluster in passive mode (warm standby) with 1/4 of the size of the main cluster. This cluster will have the same workloads and con fi gurations – pods stopped. You only need to restore the volumes and start your apps. The downtime will be inferior to the backup and restore, but it's still there. The cost is superior to backup and restore, it's not double, but it's something to consider.

Slide 38

Slide 38 text

Final notes AWS Backup and Region Replication 38 It's complex, even with automation, it's expensive, it's not that easy to maintain, manage and operate and the cost cloud be inviable if we compare with the bene fi ts.

Slide 39

Slide 39 text

Final notes Use of a Cloud Native Storage 39 Move from EBS/EFS to a Cloud Native Storage can o ff er more fl exibility to backup and restore volumes, but it's one more tool to manage and the performance – with one more layer – can be inferior from EBS/EFS, even with tuning and fast ssd disks.

Slide 40

Slide 40 text

Final notes Active Monitoring and Metrics 40 To prevent downtime, you need a very sophisticated monitoring and metric system. With this, you can notice the bad behavior of your Apps like spikes and intermittences before it's become a problem. With that, you can move fast to another cloud provider or at least to another region in your cloud provider. Perhaps one of the most important things is to have your monitoring and metric system in a di ff erent place than your APPs are running, like another cloud provider or a SaaS Service. This strategy is essential to avoid losing the visibility of your systems, services, apps, and cloud infrastructure.

Slide 41

Slide 41 text

Final notes Backup 41 We have to do the backups in a di ff erent cloud provider or at least in another region of your cloud provider. It's the same strategy as the monitoring and metric system. It needs to be in another cloud region than the region with problems to be available.

Slide 42

Slide 42 text

Final notes Our suggestion 42 In our opinion, you should use the option warm-standby if the idea is to restore systems and services with minimal downtime. Why? It's the best costs vs. Bene fi ts from the three scenarios. Bene fi ts? The downtime will be inferior to Backup/Restore, and the service will be stable in another region. Cost? The cost will be inferior to Active/Active with a good response time to restore systems and services. Complexity? It's not that complex. With terraform and GitOps we can manage it. Learning Curve? It's small to medium for the client DevOps team. Estimated time to achieve that? I would say 160h for the initial setup, tests, and documentation for the fi rst phase.

Slide 43

Slide 43 text

Final notes Too much information? 43 It would help if you decided how much availability you want. With that settled, we can suggest a proper solution. This investigation was merely a start point to create a DR strategy and a robust DR Plan to: - Move to another cloud region - Move to another cloud provider 
 
 Just keep in mind that to migrate/restore faster and with minimal downtime, the use of stateless applications is essential.

Slide 44

Slide 44 text

[email protected] Photos: 
 Unsplash 
 Kubecon

Slide 45

Slide 45 text

Flato Add your header here 45 Add Your Sub-header Here Lorem Ipsum Flato Presentation Thank You