Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OCB: Cloud Native Disaster Recovery for Stateful Workloads

OCB: Cloud Native Disaster Recovery for Stateful Workloads

In this OpenShift Commons Briefing, Raffalele Spazzoli (Red Hat) will introduce a new way of thinking about disaster recovery in a cloud native setting. He will introduce the concept of Cloud Native Disaster Recovery, which characteristics it should have, and the problems that need to be addressed when designing a disaster recovery strategy for stateful applications in a cloud native setting.

YouTube: https://youtu.be/nyspc6HcDQA

Red Hat Livestreaming

August 24, 2021
Tweet

More Decks by Red Hat Livestreaming

Other Decks in Technology

Transcript

  1. Cloud Native Disaster
    Recovery
    Raffaele Spazzoli
    Architect at Red Hat
    Lead at TAG Storage
    Alex Chircop
    CEO at Storage OS
    Co-chair TAG Storage
    1

    View Slide

  2. Cloud Native Disaster Recovery
    2
    Concern Traditional DR Cloud Native DR
    Type of deployment active/passive, rarely active/active Active / active
    Disaster Detection and Recovery Trigger Human Autonomous
    Disaster Recovery Procedure execution Mix of manual and automated tasks Automated
    Recovery Time Objective (RTO) From close to zero to hours Close to zero
    Recovery Point Objective (RPO) From zero to hours Exactly zero for strongly consistent
    deployments. Theoretically unbounded,
    practically close to zero for eventual
    consistent deployments.
    DR Process Owner Often the Storage Team Application Team
    Capabilities needed for DR From storage (backup/restore, volume
    replication)
    From networking (east-west communication,
    global load balancer)
    The information in this table are generally accepted attributes and measurements for Disaster Recovery architectures

    View Slide

  3. CNDR - Reference Architecture
    3 Traditional DR strategies are still possible in the cloud. Here we are focusing on a new approach.

    View Slide

  4. Availability and Consistency
    4
    High Availability (HA) is a property of a system
    that allows it to continue performing normally in
    the presence of failures.
    What happens when a component in a Failure
    Domain is lost?
    Some definitions
    Consistency is the property of a distributed
    stateful workload by which all of the instances
    of the workload “observe” the same state.
    Consistency
    Disaster recovery (DR) refers to the strategy for
    recovering from the complete loss of a
    datacenter.
    What happens when an entire Failure Domain
    is lost?
    Disaster Recovery
    Failure domains are areas which may fail due to
    a single event. Examples: nodes, racks,
    kubernetes clusters, network zones and
    datacenters
    Failure Domain
    High-Availability

    View Slide

  5. CAP Theorem
    5
    Product CAP Choice (either Availability or
    Consistency)
    DynamoDB Availability
    Cassandra Availability
    CockroachDB Consistency
    MongoDB Consistency
    PACELC corollary: in the absence of network partition, one
    can only optimize either for latency or consistency

    View Slide

  6. Consensus Protocols
    6
    Consensus Protocols allow for the coordination of distributed processes by agreeing on actions to be
    taken.
    Apache Bookkeeper is an example of Reliable Replicated Data Store (for log abstraction use case: append only)
    Building on consensus protocols and the
    concept of sharing a log of operations, it is
    possible to build a Reliable Replicated Data
    Store
    Reliable Replicated Data Store
    Protocols in which all participants perform the
    same action. They are implemented around the
    concepts of leader election and strict
    majority: Paxos, Raft.
    Shared State Consensus Protocols
    Protocols in which all participants perform
    different actions. They require the
    acknowledgment of all participants and are
    vulnerable to network partitioning: 2PC, 3PC
    Unshared State Consensus Protocols

    View Slide

  7. Anatomy of a Stateful Application
    7
    Partitions are a way to increase the general
    throughput of the workload. This is achieved by
    breaking the state space in partitions or shards.
    Partitions
    Putting it all together
    Stateful Workload Logical Tiers
    Replicas are a way to increase availability of a
    stateful workload.
    Replicas

    View Slide

  8. Examples of Consensus Protocol choices
    8
    Product Replica consensus protocol Shard consensus protocol
    Etcd Raft N/A (no support for shards)
    Consul Raft N/A (no support for shards)
    Zookeeper Atomic Broadcast (a derivative of
    Paxos)
    N/A (no support for shards)
    ElasticSearch Paxos N/A (No support for transactions)
    Cassandra Paxos Supported, but details are not available.
    MongoDB Paxos Homegrown protocol.
    CockroachDB Raft 2PC
    YugabyteDB Raft 2PC
    TiKV Raft Percolator
    Spanner Raft 2PC+high-precision time service
    Kafka A custom derivative of PacificA Custom Implementation of 2PC

    View Slide

  9. Strongly-Consistent vs Eventually Consistent CNDR
    9
    Concern Strongly-Consistent Eventually-Consistent
    RPO Zero Theoretically unbounded, practically close
    to zero. Temporarily inconsistency can
    happen. Note: eventual consistency does
    not mean eventual correctness.
    RTO Few seconds. Few seconds.
    Latency String sensitivity to latency between
    failure domains, single transaction
    latency will be >= 2 x worst latency
    between failure domains.
    No sensitivity to latency between failure
    domains.
    Throughput Theoretically scales linearly with the
    number of instances, practically is
    dependent on the workload type and
    the max throughput available
    between failure domains.
    Theoretically scales linearly with the
    number of instances, practically is
    dependent on the workload type.
    Minimum required
    failure domains
    three two

    View Slide

  10. CNDR -- Strong Consistency - Kubernetes Reference Architecture
    10

    View Slide

  11. CNDR -- Eventual Consistency - Kubernetes Reference Architecture
    11

    View Slide

  12. References
    12
    TAG Storage Cloud Native Disaster Recovery
    Demos and reference implementations:
    Geographically Distributed Stateful Workloads Part One: Cluster Preparation
    Geographically Distributed Stateful Workloads Part Two: CockroachDB
    Geographically Distributed Stateful Workloads - Part 3: Keycloak

    View Slide

  13. Thank you
    13

    View Slide

  14. Short Demo
    14

    View Slide

  15. Demo Scenario
    15

    View Slide

  16. DR Simulation
    16

    View Slide