Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EKS DR Investigation

Guto Carvalho
February 09, 2022

EKS DR Investigation

Guto Carvalho

February 09, 2022
Tweet

More Decks by Guto Carvalho

Other Decks in Technology

Transcript

  1. 3 Flato Presentation EKS DR Let's investigate it Guto Carvalho

    Cloud Native Engineer DR Active|Active DR Warm stand-by DR Restore from backup Key Principles Disaster Recovery 101 (DR) Other tools Final notes AWS Cross Region Replication
  2. Disaster Recovery What it means? 5 Service mesh introduction What

    is it? D i s a s t e r r e c o v e r y ( D R ) i s a n organization's ability to respond to and recover from an event that negatively a ff ects business operations. What's the goal? The goal of DR methods is to enable the organization to regain use of critical systems and IT infrastructure as soon as possible after a disaster occurs. How we do that? We need to create a plan that address everything that we need to restore the systems, restore the operations and to minimise data and image loss.
  3. EKS Disaster Recovery The idea and requirements 6 Service mesh

    introduction AWS EKS Our goal is to fi nd a way to maintain services online, or at least, reduce the downtime of EKS Production cluster in case of a AWS Region outage Recovering To do that, we need to fi nd a way to restore the cluster in another region, the fastest way possible, using automation through infrastructure as code or other solution.
  4. Methods that we can use Essentially we have 3 methods

    7 Active-Active In that scenario we have 2 clusters running, we deploy on both clusters, we operate and monitor both clusters, in case of a fail on the primary, no problem at all, everything will be fi ne. Warm standby We have a smaller second cluster running, with the same workloads and con fi gurations in a tiny scale. In case of a disaster we can manually switch the LB/ DNS to the second cluster and scale up. No downtime 
 Very expensive 
 Easy to manage 
 Data replication can be challenging Restore from backup Here we need to recreate the cluster in another region (with automation) and restore everything from the backup system Some downtime (minutes > hours) Expensive 
 Somewhat easy to manage 
 Data replication can be challenging Downtime (could be hours > days) Not that expensive 
 Hard to do and manage due to uncertainties 
 No need to replicate data 
 Restore from backup can take time
  5. Methods that we can use EKS DR Investigation 8 Service

    mesh introduction Options > Active-Active 
 > Warm-standby 
 > Restore from backup Solutions For each method, we have a way to restore the business. We can use di ff erent tools to do it. Let's see some options for each one of them.
  6. [Active|Active] EKS DR Investigation 10 Primary Secondary Same size 


    Same con fi gurations 
 Same workloads 
 Replica of stateful volumes (complex) EKS Cluster EKS Cluster Deploy EKS Clusters No downtime 
 Very expensive 
 Easy to manage 
 Data replication can be challenging External LB is needed 
 External DNS is needed Pros/Cons Volume Replication 

  7. Some options 11 Two EKS Clusters in di ff erent

    regions terraform argocd 
 or fl ux AWS cloud fl are Deployment in both clusters Volume management/replication DNS Provider [Active|Active] helm
  8. How to reduce the cost of the Active/Active? 12 DR

    [Active|Active] cast.ai https://cast.ai/cost-optimizer/ Smaller instances On the second active cluster que can set smaller instances with a strong autoscale con fi guration. Analysers Tools spot.io https://cast.ai/cost-optimizer/ harness.io https://harness.io/products/cloud-cost/ Instances With SaaS analysers we can fi nd and adjust pods and nodes reducing cloud costs Spot.io can facilite the use of spot instances to reduce the instances costs The Load Balancer can send 25% of the tra ffi c to the cluster B, and in case of outage, the cluster can scale out to fi t the user needs
  9. Multi Region EKS Cluster can be complex and expensive 13

    [Active|Active] AWS Backup for EFS 
 S3 Replication 
 EBS Snapshot 
 RDS Snapshot 
 Cross-region-backup 
 Multi-AZ 
 IAM 
 EC2 EKS 
 VPC
  10. RDS 15 AWS Cross region replication Cross-region Features We can

    create RDS Cross Region Read Replicas for unencrypted MySQL and PostgreSQL databases instances. Disaster Recovery: You can create cross-region read replicas of your primary database instance to have a disaster recovery solution. If your primary region faces a disruption, you can promote the replica to a master and keep your business operational. Scaling: You can use cross-region read replicas to support read queries from your workloads across various geographic locations. This will reduce latency by serving your customers from a database that is close to them. Cross-region Migration: If you would like to migrate your database instance quickly to another AWS region, you may do so by using cross-region replication. Simply create a replica in your target region, and once it is ready, promote it to master and point your application to it.
  11. S3 16 AWS Cross region replication Cross-region Features We can

    create S3 Cross Region Replica, every object updated to in the origin S3 bucket will be replicated to the destination bucket. Disaster Recovery: You can create cross-region replicas of your primary bucket to have a disaster recovery solution. If your primary region faces a disruption, you can point your APP to the replica bucket and keep your business operational. Lower latency: you can use cross-region replication to provide lower-latency data access in di ff erent geographic regions Cross-region Migration: If you would like to migrate your bucket quickly to another AWS region, you may do so by using cross-region replication. Simply create a replica in your target region, and once it is ready, point your application to it. No extra costs: You pay Amazon S3’s usual charges for storage, requests, and inter-region data transfer for the replicated copy of data. 
 Cross-region replication is available in the US Standard, US West (Oregon), US West (N. California), EU (Ireland), EU (Frankfurt), Asia Paci fi c (Tokyo), Asia Paci fi c (Singapore), Asia Paci fi c (Sydney), and South America (Sao Paulo) regions.
  12. EBS Snapshot 17 AWS Cross Region Copy Cross-region Features We

    can create a EBS Snapshot copy across AWS regions. Disaster Recovery: You can create cross-region copy's of your EBS snapshots to have a disaster recovery solution. If your primary region faces a disruption, you can restore copy and restore the snapshot in another region, mount it in your cluster and point your app to the new pv. 
 
 We need to automate the copy, restores and remotes somehow with IaC, AWS only o ff ers the copy not the replica, it's pretty much a manual procedure.
  13. Cross Region Backup 18 AWS Backup We can create backups

    of several services in another region. 
 They o ff er scheduling, policy based backup, tag-based backup, retention management, backup audit, lifecycle policies, data encryption, cross- account backup, access policies and even item level recovery for EFS. It's a fantastic solution, but we need to adjust and con fi gure everything on the backup region to get things done and running like the original region. We'll need to use heavy automation to have everything operational again. Services covered 
 EC2 Instances 
 EBS 
 RDS 
 DynamoDB 
 EFS Storage Gateway 
 Neptune 
 DocumentDB 
 S3 (Preview)
  14. Cross Region synchronisation 19 AWS Data sync AWS DataSync is

    a cloud migration tool, often used to migrate data from on-premises to AWS, but you can also use it to sync data between AWS Storage services. DataSync o ff ers another way to sync S3 and EFS between regions, it's simple and low cost. With AWS DataSync you can easily replicate, archive, or share application data. Services covered 
 EFS 
 S3 
 FSX Windows Server FSX Lustre Price 
 1TB month 
 1,000 GB x 0.0125 USD = 12.50 USD 
 
 Total: 12.50 USD
  15. Cross Region Replica and Copy 20 AWS Calculator Trying to

    fi nd a way to calculate that, for now AWS calculador don't o ff er that type of estimation.
  16. Warm Stand-by EKS DR Investigation 22 Primary Secondary Di ff

    erent sizes 
 Same con fi gurations 
 Same workloads (replica 0) 
 Restore only volumes from backup EKS Cluster EKS Cluster Deploy EKS Clusters Some downtime 
 Expensive 
 Somewhat easy to manage 
 Restore can take time 
 External LB is needed 
 External DNS is needed 
 Need to scale UP after switch Pros/Cons Option 2 K8S Backup restore 
 volumes Secundary Secondary
  17. Some options 23 Two EKS Clusters in di ff erent

    regions terraform argocd 
 or fl ux AWS cloud fl are Deployment in both clusters DNS Provider Warm Stand-by Backup/Restore velero Volume management / cross-region-replica helm
  18. Restore from backup EKS DR Investigation 25 new cluster Need

    to create a new cluster 
 Restore K8S Backups 
 - Workloads backups 
 - Volumes Backups EKS Cluster EKS Clusters Downtime (could be hours > days) Not that expensive 
 Hard to do and manage due to uncertainties 
 No need to replicate data 
 No need to reinstall everything 
 Restore from backup can take time 
 Need an external LB Need an external DNS Pros/Cons Option 1 K8S Workloads 
 Backup K8S Volumes 
 Backup restore restore Cluster 
 Provisioner 
 restore
  19. Restore from backup EKS DR Investigation 26 new cluster Need

    to create a new cluster 
 GitOps will recreate the con fi guration 
 Need to restore the aws backup EKS Cluster EKS Cluster Downtime (could be hours > days) Not that expensive 
 Hard to do and manage due to uncertainties 
 Need to replicate data across regions 
 Restore from AWS backup can take time 
 Need to reinstall everything 
 Need an external LB Need an external DNS Pros/Cons Option 2 Cluster 
 Provisioner 
 Storages 
 Backup from AWS restore Apps 
 Provisioner 
 restore restore
  20. Some options 27 Two EKS Clusters in di ff erent

    regions terraform argocd 
 or fl ux longhorn cloud fl are Deployment in both clusters Volume management/backup DNS Provider Restore from backup Backup/Restore velero If the primary cluster is using longhorn instead of EBS/EFS for volumes, it's easy to restore statefulset volumes on the new cluster Longhorn Helm can facilitate the deployment of the workloads and help simplify package management of the cluster, especially between di ff erent environments Helm helm longhorn Option 1
  21. Some options 28 Two EKS Clusters in di ff erent

    regions terraform argocd 
 or fl ux AWS Backup 
 AWS Datasync cloud fl are Deployment in both clusters Volume management/sync/backup DNS Provider Restore from backup Helm can facilitate the deployment of the workloads and help simplify package management of the cluster, especially between di ff erent environments Helm helm Option 2
  22. How to reduce the the downtime of Backup/Restore? 29 Restore

    from backup Using automation is the only way to reduce the downtime, it's an essential and vital part of the DR Plan. Automation If the restore plan is important, the order to restore things is the most important part. We need to be sure the fastest and secure way to have everything up and running in the right order. Restore Plan It's essential to test the disaster recovery plan to ensure that it is still viable and that it covers everything we need. During the test, we can enhance the procedure documentation and, make the necessary adjustments, always following a continuous improvement strategy. Restore Test Using an external SaaS backup can help to backup and sometimes to speed up the restore. SaaS Backup Some companies o ff ers the entire DR as a service. SaaS DR
  23. Things that we need to do to have a good

    DR strategy Key principles
  24. Key principles for a good DR strategy 31 Service mesh

    introduction External DNS An external DNS can help to switch to another cloud provider or region fast, but it can have issues with dns zone cache and TTL. External Load Balancer An external DNS can help to switch to another cloud provider or region fast, di ff erent from DNS, there is no gap in the client side. External database External database like atlas can be very useful because you can move your workloads without move your data or sgbd. External Object Storage External object storage like Minio can be very useful because you can move your workloads without move your static data. Replication In case of statefulsets, we need to replicate the volumes somehow, this can be challenging and complex, but it's the only way to do a DR External Observability It's essential to maintain the visibility of your services and metrics, if it's running on the same region, you may lose your eyes and will operate on the dark. External Backup solution External backup service (SaaS) is essential to have speed, security and to be able to restore your statefulsets to any cloud provider you want, especially if you don't have a replication solution up and running. External Log solution External backup service (SaaS) is essential to have speed, security and to be able to restore your statefulsets to any cloud provider you want, especially if you don't have a replication solution up and running.
  25. Key principles for a good DR strategy 32 Service mesh

    introduction Statefulsets to object storage It's important to map the statefulset apps and refactor to use "Object Storage" instead. With that, we can move to another region with more fl exibility and we can also use HPA for the App. External or Multi-Region Database It's important to use external databases like "Atlas" or Cloud Databases with Multi-Region HA / Replication.
  26. Other tools Closed Solutions 34 SaaS Arpio.io It's an AWS

    replication service, it can replicate con fi guration and data from a region to another. Kasten.io It's a kubernetes backup system.... 
 Cohesity.com It's a kubernetes backup system.... Portworx.com It's a kubernetes backup system.... Trillio.io It's a kubernetes backup system....
  27. Other tools Read more... 35 Arpio.io https://arpio.io/how-it-works/ Kasten.io https://www.kasten.io/kubernetes/use-cases/disaster-recovery/ Cohesity.com

    https://www.cohesity.com/products/sitecontinuity-for-disaster-recovery-as-a-service/ Portworx.com https://portworx.com/products/px-backup/ Trillio.io https://trilio.io/triliovault-for-kubernetes/
  28. Final notes EKS Disaster Recovery Investigation 37 The ideal option

    is to have Active/Active Clusters, but the volume replication can be challenging, and the cost will double - at least. The cheapest and more a ff ordable option is to use backup and restore in another region, but the downtime can be ample. It will a ff ect the users and the business and be bad for the company's image. Something between is to have the second cluster in passive mode (warm standby) with 1/4 of the size of the main cluster. This cluster will have the same workloads and con fi gurations – pods stopped. You only need to restore the volumes and start your apps. The downtime will be inferior to the backup and restore, but it's still there. The cost is superior to backup and restore, it's not double, but it's something to consider.
  29. Final notes AWS Backup and Region Replication 38 It's complex,

    even with automation, it's expensive, it's not that easy to maintain, manage and operate and the cost cloud be inviable if we compare with the bene fi ts.
  30. Final notes Use of a Cloud Native Storage 39 Move

    from EBS/EFS to a Cloud Native Storage can o ff er more fl exibility to backup and restore volumes, but it's one more tool to manage and the performance – with one more layer – can be inferior from EBS/EFS, even with tuning and fast ssd disks.
  31. Final notes Active Monitoring and Metrics 40 To prevent downtime,

    you need a very sophisticated monitoring and metric system. With this, you can notice the bad behavior of your Apps like spikes and intermittences before it's become a problem. With that, you can move fast to another cloud provider or at least to another region in your cloud provider. Perhaps one of the most important things is to have your monitoring and metric system in a di ff erent place than your APPs are running, like another cloud provider or a SaaS Service. This strategy is essential to avoid losing the visibility of your systems, services, apps, and cloud infrastructure.
  32. Final notes Backup 41 We have to do the backups

    in a di ff erent cloud provider or at least in another region of your cloud provider. It's the same strategy as the monitoring and metric system. It needs to be in another cloud region than the region with problems to be available.
  33. Final notes Our suggestion 42 In our opinion, you should

    use the option warm-standby if the idea is to restore systems and services with minimal downtime. Why? It's the best costs vs. Bene fi ts from the three scenarios. Bene fi ts? The downtime will be inferior to Backup/Restore, and the service will be stable in another region. Cost? The cost will be inferior to Active/Active with a good response time to restore systems and services. Complexity? It's not that complex. With terraform and GitOps we can manage it. Learning Curve? It's small to medium for the client DevOps team. Estimated time to achieve that? I would say 160h for the initial setup, tests, and documentation for the fi rst phase.
  34. Final notes Too much information? 43 It would help if

    you decided how much availability you want. With that settled, we can suggest a proper solution. This investigation was merely a start point to create a DR strategy and a robust DR Plan to: - Move to another cloud region - Move to another cloud provider 
 
 Just keep in mind that to migrate/restore faster and with minimal downtime, the use of stateless applications is essential.
  35. Flato Add your header here 45 Add Your Sub-header Here

    Lorem Ipsum Flato Presentation Thank You