Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EKS DR Investigation

Guto Carvalho
February 09, 2022

EKS DR Investigation

Guto Carvalho

February 09, 2022
Tweet

More Decks by Guto Carvalho

Other Decks in Technology

Transcript

  1. 1
    EKS DR
    I N V E S T I G A T I O N

    View Slide

  2. Let's dive in the DR options for EKS
    EKS DR Research

    View Slide

  3. 3
    Flato Presentation
    EKS DR
    Let's investigate it
    Guto Carvalho
    Cloud Native Engineer
    DR Active|Active
    DR Warm stand-by
    DR Restore from backup
    Key Principles
    Disaster Recovery 101 (DR)
    Other tools
    Final notes
    AWS Cross Region Replication

    View Slide

  4. Introduction
    Disaster Recovery

    View Slide

  5. Disaster Recovery
    What it means?
    5
    Service mesh introduction
    What is it?


    D i s a s t e r r e c o v e r y ( D R ) i s a n
    organization's ability to respond to and
    recover from an event that negatively
    a
    ff
    ects business operations.
    What's the goal?


    The goal of DR methods is to enable the
    organization to regain use of critical
    systems and IT infrastructure as soon as
    possible after a disaster occurs.
    How we do that?


    We need to create a plan that address
    everything that we need to restore the
    systems, restore the operations and to
    minimise data and image loss.

    View Slide

  6. EKS Disaster Recovery
    The idea and requirements
    6
    Service mesh introduction
    AWS EKS


    Our goal is to
    fi
    nd a way to maintain
    services online, or at least, reduce the
    downtime of EKS Production cluster in
    case of a AWS Region outage
    Recovering


    To do that, we need to
    fi
    nd a way to
    restore the cluster in another region, the
    fastest way possible, using automation
    through infrastructure as code or other
    solution.

    View Slide

  7. Methods that we can use
    Essentially we have 3 methods
    7
    Active-Active


    In that scenario we have 2 clusters
    running, we deploy on both clusters, we
    operate and monitor both clusters, in
    case of a fail on the primary, no problem
    at all, everything will be
    fi
    ne.
    Warm standby


    We have a smaller second cluster
    running, with the same workloads and
    con
    fi
    gurations in a tiny scale. In case of a
    disaster we can manually switch the LB/
    DNS to the second cluster and scale up.
    No downtime

    Very expensive

    Easy to manage

    Data replication can be challenging
    Restore from backup


    Here we need to recreate the cluster in
    another region (with automation) and
    restore everything from the backup
    system
    Some downtime (minutes > hours)


    Expensive

    Somewhat easy to manage

    Data replication can be challenging
    Downtime (could be hours > days)


    Not that expensive

    Hard to do and manage due to uncertainties

    No need to replicate data

    Restore from backup can take time

    View Slide

  8. Methods that we can use
    EKS DR Investigation
    8
    Service mesh introduction
    Options


    > Active-Active

    > Warm-standby

    > Restore from backup
    Solutions


    For each method, we have a way to restore
    the business. We can use di
    ff
    erent tools to
    do it. Let's see some options for each one of
    them.

    View Slide

  9. Let's understand how it works
    DR Active Active

    View Slide

  10. [Active|Active]
    EKS DR Investigation
    10
    Primary Secondary
    Same size

    Same con
    fi
    gurations

    Same workloads

    Replica of stateful volumes (complex)
    EKS Cluster


    EKS Cluster


    Deploy


    EKS Clusters


    No downtime

    Very expensive

    Easy to manage

    Data replication can be challenging


    External LB is needed

    External DNS is needed
    Pros/Cons


    Volume Replication

    View Slide

  11. Some options
    11
    Two EKS Clusters in di
    ff
    erent regions
    terraform
    argocd

    or
    fl
    ux
    AWS
    cloud
    fl
    are
    Deployment in both clusters
    Volume management/replication
    DNS Provider
    [Active|Active]
    helm

    View Slide

  12. How to reduce the cost of the Active/Active?
    12
    DR [Active|Active]
    cast.ai
    https://cast.ai/cost-optimizer/
    Smaller instances
    On the second active cluster que can set smaller instances
    with a strong autoscale con
    fi
    guration.
    Analysers
    Tools
    spot.io
    https://cast.ai/cost-optimizer/
    harness.io
    https://harness.io/products/cloud-cost/
    Instances
    With SaaS analysers we can
    fi
    nd and
    adjust pods and nodes reducing cloud
    costs
    Spot.io can facilite the use of spot
    instances to reduce the instances costs
    The Load Balancer can send 25% of the
    tra
    ffi
    c to the cluster B, and in case of
    outage, the cluster can scale out to
    fi
    t
    the user needs

    View Slide

  13. Multi Region EKS Cluster can be complex and expensive
    13
    [Active|Active]
    AWS Backup for EFS

    S3 Replication

    EBS Snapshot

    RDS Snapshot

    Cross-region-backup

    Multi-AZ

    IAM

    EC2


    EKS

    VPC

    View Slide

  14. Let's understand how it works
    AWS Backup & Cross Region Features

    View Slide

  15. RDS
    15
    AWS Cross region replication
    Cross-region Features


    We can create RDS Cross Region Read Replicas for unencrypted MySQL and
    PostgreSQL databases instances.
    Disaster Recovery: You can create cross-region read
    replicas of your primary database instance to have a
    disaster recovery solution. If your primary region
    faces a disruption, you can promote the replica to a
    master and keep your business operational.
    Scaling: You can use cross-region read replicas to
    support read queries from your workloads across
    various geographic locations. This will reduce
    latency by serving your customers from a database
    that is close to them.


    Cross-region Migration: If you would like to migrate
    your database instance quickly to another AWS
    region, you may do so by using cross-region
    replication. Simply create a replica in your target
    region, and once it is ready, promote it to master and
    point your application to it.


    View Slide

  16. S3
    16
    AWS Cross region replication
    Cross-region Features


    We can create S3 Cross Region Replica, every object updated to in the origin S3 bucket
    will be replicated to the destination bucket.
    Disaster Recovery: You can create cross-region
    replicas of your primary bucket to have a disaster
    recovery solution. If your primary region faces a
    disruption, you can point your APP to the replica
    bucket and keep your business operational.
    Lower latency: you can use cross-region replication
    to provide lower-latency data access in di
    ff
    erent
    geographic regions


    Cross-region Migration: If you would like to migrate
    your bucket quickly to another AWS region, you may
    do so by using cross-region replication. Simply create
    a replica in your target region, and once it is ready,
    point your application to it.


    No extra costs: You pay Amazon S3’s usual charges
    for storage, requests, and inter-region data transfer
    for the replicated copy of data.



    Cross-region replication is available in the US
    Standard, US West (Oregon), US West (N. California),
    EU (Ireland), EU (Frankfurt), Asia Paci
    fi
    c (Tokyo), Asia
    Paci
    fi
    c (Singapore), Asia Paci
    fi
    c (Sydney), and South
    America (Sao Paulo) regions.

    View Slide

  17. EBS Snapshot
    17
    AWS Cross Region Copy
    Cross-region Features


    We can create a EBS Snapshot copy across AWS regions.
    Disaster Recovery: You can create cross-region
    copy's of your EBS snapshots to have a disaster
    recovery solution. If your primary region faces a
    disruption, you can restore copy and restore the
    snapshot in another region, mount it in your cluster
    and point your app to the new pv.


    We need to automate the copy, restores and remotes
    somehow with IaC, AWS only o
    ff
    ers the copy not the
    replica, it's pretty much a manual procedure.

    View Slide

  18. Cross Region Backup
    18
    AWS Backup
    We can create backups of several services in another region.



    They o
    ff
    er scheduling, policy based backup, tag-based backup, retention
    management, backup audit, lifecycle policies, data encryption, cross-
    account backup, access policies and even item level recovery for EFS.




    It's a fantastic solution, but we need to adjust and con
    fi
    gure everything on
    the backup region to get things done and running like the original region.


    We'll need to use heavy automation to have everything operational again.
    Services covered

    EC2 Instances

    EBS

    RDS

    DynamoDB

    EFS


    Storage Gateway

    Neptune

    DocumentDB

    S3 (Preview)

    View Slide

  19. Cross Region synchronisation
    19
    AWS Data sync
    AWS DataSync is a cloud migration tool, often used to migrate data from
    on-premises to AWS, but you can also use it to sync data between AWS
    Storage services.




    DataSync o
    ff
    ers another way to sync S3 and EFS between regions, it's
    simple and low cost.




    With AWS DataSync you can easily replicate, archive, or share application
    data.


    Services covered

    EFS

    S3

    FSX Windows Server


    FSX Lustre
    Price

    1TB month

    1,000 GB x 0.0125 USD = 12.50 USD


    Total: 12.50 USD

    View Slide

  20. Cross Region Replica and Copy
    20
    AWS Calculator
    Trying to
    fi
    nd a way to calculate that, for now AWS calculador don't o
    ff
    er
    that type of estimation.


    View Slide

  21. Warm stand-by
    Let's understand how it works

    View Slide

  22. Warm Stand-by
    EKS DR Investigation
    22
    Primary
    Secondary
    Di
    ff
    erent sizes

    Same con
    fi
    gurations

    Same workloads (replica 0)

    Restore only volumes from backup
    EKS Cluster


    EKS Cluster


    Deploy


    EKS Clusters


    Some downtime

    Expensive

    Somewhat easy to manage

    Restore can take time

    External LB is needed

    External DNS is needed

    Need to scale UP after switch
    Pros/Cons


    Option 2


    K8S Backup


    restore

    volumes
    Secundary
    Secondary

    View Slide

  23. Some options
    23
    Two EKS Clusters in di
    ff
    erent regions terraform
    argocd

    or
    fl
    ux
    AWS
    cloud
    fl
    are
    Deployment in both clusters
    DNS Provider
    Warm Stand-by
    Backup/Restore
    velero
    Volume management / cross-region-replica
    helm

    View Slide

  24. Restore from backup
    Let's understand how it works

    View Slide

  25. Restore from backup
    EKS DR Investigation
    25
    new cluster
    Need to create a new cluster

    Restore K8S Backups

    - Workloads backups

    - Volumes Backups
    EKS Cluster


    EKS Clusters


    Downtime (could be hours > days)


    Not that expensive

    Hard to do and manage due to uncertainties

    No need to replicate data

    No need to reinstall everything

    Restore from backup can take time

    Need an external LB


    Need an external DNS
    Pros/Cons


    Option 1


    K8S Workloads

    Backup


    K8S Volumes

    Backup


    restore


    restore


    Cluster

    Provisioner



    restore


    View Slide

  26. Restore from backup
    EKS DR Investigation
    26
    new cluster
    Need to create a new cluster

    GitOps will recreate the con
    fi
    guration

    Need to restore the aws backup
    EKS Cluster


    EKS Cluster


    Downtime (could be hours > days)


    Not that expensive

    Hard to do and manage due to uncertainties

    Need to replicate data across regions

    Restore from AWS backup can take time

    Need to reinstall everything

    Need an external LB


    Need an external DNS
    Pros/Cons


    Option 2


    Cluster

    Provisioner



    Storages

    Backup from AWS


    restore


    Apps

    Provisioner



    restore


    restore


    View Slide

  27. Some options
    27
    Two EKS Clusters in di
    ff
    erent regions
    terraform
    argocd

    or
    fl
    ux
    longhorn
    cloud
    fl
    are
    Deployment in both clusters
    Volume management/backup
    DNS Provider
    Restore from backup
    Backup/Restore
    velero
    If the primary cluster is using
    longhorn instead of EBS/EFS for
    volumes, it's easy to restore
    statefulset volumes on the new
    cluster
    Longhorn


    Helm can facilitate the deployment of
    the workloads and help simplify
    package management of the cluster,
    especially between di
    ff
    erent
    environments
    Helm


    helm
    longhorn
    Option 1


    View Slide

  28. Some options
    28
    Two EKS Clusters in di
    ff
    erent regions
    terraform
    argocd

    or
    fl
    ux
    AWS Backup

    AWS Datasync
    cloud
    fl
    are
    Deployment in both clusters
    Volume management/sync/backup
    DNS Provider
    Restore from backup
    Helm can facilitate the deployment of
    the workloads and help simplify
    package management of the cluster,
    especially between di
    ff
    erent
    environments
    Helm


    helm
    Option 2


    View Slide

  29. How to reduce the the downtime of Backup/Restore?
    29
    Restore from backup
    Using automation is the only way to reduce the downtime,
    it's an essential and vital part of the DR Plan.
    Automation
    If the restore plan is important, the order to restore things
    is the most important part. We need to be sure the fastest
    and secure way to have everything up and running in the
    right order.
    Restore Plan
    It's essential to test the disaster recovery plan to ensure
    that it is still viable and that it covers everything we need.
    During the test, we can enhance the procedure
    documentation and, make the necessary adjustments,
    always following a continuous improvement strategy.
    Restore Test
    Using an external SaaS backup can help to backup and
    sometimes to speed up the restore.
    SaaS Backup
    Some companies o
    ff
    ers the entire DR as a service.
    SaaS DR

    View Slide

  30. Things that we need to do to have a good DR strategy
    Key principles

    View Slide

  31. Key principles
    for a good DR strategy
    31
    Service mesh introduction
    External DNS


    An external DNS can help to switch to
    another cloud provider or region fast,
    but it can have issues with dns zone
    cache and TTL.
    External Load Balancer


    An external DNS can help to switch to
    another cloud provider or region fast,
    di
    ff
    erent from DNS, there is no gap in
    the client side.
    External database


    External database like atlas can be very
    useful because you can move your
    workloads without move your data or
    sgbd.
    External Object Storage


    External object storage like Minio can be
    very useful because you can move your
    workloads without move your static
    data.
    Replication


    In case of statefulsets, we need to
    replicate the volumes somehow, this can
    be challenging and complex, but it's the
    only way to do a DR
    External Observability


    It's essential to maintain the visibility of
    your services and metrics, if it's running
    on the same region, you may lose your
    eyes and will operate on the dark.
    External Backup solution


    External backup service (SaaS) is
    essential to have speed, security and to
    be able to restore your statefulsets to
    any cloud provider you want, especially if
    you don't have a replication solution up
    and running.
    External Log solution


    External backup service (SaaS) is
    essential to have speed, security and to
    be able to restore your statefulsets to
    any cloud provider you want, especially if
    you don't have a replication solution up
    and running.

    View Slide

  32. Key principles
    for a good DR strategy
    32
    Service mesh introduction
    Statefulsets to object storage
    It's important to map the statefulset
    apps and refactor to use "Object Storage"
    instead. With that, we can move to
    another region with more
    fl
    exibility and
    we can also use HPA for the App.
    External or Multi-Region Database
    It's important to use external databases
    like "Atlas" or Cloud Databases with
    Multi-Region HA / Replication.

    View Slide

  33. Same goals, closed tools
    Other tools

    View Slide

  34. Other tools
    Closed Solutions
    34
    SaaS


    Arpio.io


    It's an AWS replication service, it can
    replicate con
    fi
    guration and data from a
    region to another.
    Kasten.io
    It's a kubernetes backup system....

    Cohesity.com
    It's a kubernetes backup system....
    Portworx.com
    It's a kubernetes backup system....
    Trillio.io
    It's a kubernetes backup system....

    View Slide

  35. Other tools
    Read more...
    35
    Arpio.io


    https://arpio.io/how-it-works/


    Kasten.io
    https://www.kasten.io/kubernetes/use-cases/disaster-recovery/
    Cohesity.com
    https://www.cohesity.com/products/sitecontinuity-for-disaster-recovery-as-a-service/
    Portworx.com
    https://portworx.com/products/px-backup/
    Trillio.io
    https://trilio.io/triliovault-for-kubernetes/

    View Slide

  36. EKS DR
    Final notes

    View Slide

  37. Final notes
    EKS Disaster Recovery Investigation
    37
    The ideal option is to have Active/Active Clusters, but the volume
    replication can be challenging, and the cost will double - at least.




    The cheapest and more a
    ff
    ordable option is to use backup and
    restore in another region, but the downtime can be ample. It will
    a
    ff
    ect the users and the business and be bad for the company's
    image.




    Something between is to have the second cluster in passive mode
    (warm standby) with 1/4 of the size of the main cluster. This cluster
    will have the same workloads and con
    fi
    gurations – pods stopped.
    You only need to restore the volumes and start your apps.


    The downtime will be inferior to the backup and restore, but it's still
    there.


    The cost is superior to backup and restore, it's not double, but it's
    something to consider.

    View Slide

  38. Final notes
    AWS Backup and Region Replication
    38
    It's complex, even with automation, it's expensive, it's not that easy
    to maintain, manage and operate and the cost cloud be inviable if we
    compare with the bene
    fi
    ts.

    View Slide

  39. Final notes
    Use of a Cloud Native Storage
    39
    Move from EBS/EFS to a Cloud Native Storage can o
    ff
    er more
    fl
    exibility to backup and restore volumes, but it's one more tool to
    manage and the performance – with one more layer – can be
    inferior from EBS/EFS, even with tuning and fast ssd disks.

    View Slide

  40. Final notes
    Active Monitoring and Metrics
    40
    To prevent downtime, you need a very sophisticated monitoring
    and metric system. With this, you can notice the bad behavior of
    your Apps like spikes and intermittences before it's become a
    problem. With that, you can move fast to another cloud provider or
    at least to another region in your cloud provider.


    Perhaps one of the most important things is to have your
    monitoring and metric system in a di
    ff
    erent place than your APPs
    are running, like another cloud provider or a SaaS Service.




    This strategy is essential to avoid losing the visibility of your
    systems, services, apps, and cloud infrastructure.

    View Slide

  41. Final notes
    Backup
    41
    We have to do the backups in a di
    ff
    erent cloud provider or at least in
    another region of your cloud provider. It's the same strategy as the
    monitoring and metric system. It needs to be in another cloud region
    than the region with problems to be available.

    View Slide

  42. Final notes
    Our suggestion
    42
    In our opinion, you should use the option warm-standby if the idea is to
    restore systems and services with minimal downtime.




    Why?


    It's the best costs vs. Bene
    fi
    ts from the three scenarios.




    Bene
    fi
    ts?


    The downtime will be inferior to Backup/Restore, and the service will be
    stable in another region.




    Cost?


    The cost will be inferior to Active/Active with a good response time to
    restore systems and services.




    Complexity?


    It's not that complex. With terraform and GitOps we can manage it.




    Learning Curve?


    It's small to medium for the client DevOps team.




    Estimated time to achieve that?


    I would say 160h for the initial setup, tests, and documentation for the
    fi
    rst phase.

    View Slide

  43. Final notes
    Too much information?
    43
    It would help if you decided how much availability you want.
    With that settled, we can suggest a proper solution.


    This investigation was merely a start point to create a DR
    strategy and a robust DR Plan to:


    - Move to another cloud region


    - Move to another cloud provider


    Just keep in mind that to migrate/restore faster and with
    minimal downtime, the use of stateless applications is
    essential.

    View Slide

  44. [email protected]
    Photos:

    Unsplash

    Kubecon

    View Slide

  45. Flato
    Add your header here
    45
    Add Your Sub-header Here Lorem Ipsum
    Flato Presentation
    Thank You

    View Slide