Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OCB: Cloud Native Disaster Recovery for Stateful Workloads

OCB: Cloud Native Disaster Recovery for Stateful Workloads

In this OpenShift Commons Briefing, Raffalele Spazzoli (Red Hat) will introduce a new way of thinking about disaster recovery in a cloud native setting. He will introduce the concept of Cloud Native Disaster Recovery, which characteristics it should have, and the problems that need to be addressed when designing a disaster recovery strategy for stateful applications in a cloud native setting.

YouTube: https://youtu.be/nyspc6HcDQA


Red Hat Livestreaming

August 24, 2021


  1. Cloud Native Disaster Recovery Raffaele Spazzoli Architect at Red Hat

    Lead at TAG Storage Alex Chircop CEO at Storage OS Co-chair TAG Storage 1
  2. Cloud Native Disaster Recovery 2 Concern Traditional DR Cloud Native

    DR Type of deployment active/passive, rarely active/active Active / active Disaster Detection and Recovery Trigger Human Autonomous Disaster Recovery Procedure execution Mix of manual and automated tasks Automated Recovery Time Objective (RTO) From close to zero to hours Close to zero Recovery Point Objective (RPO) From zero to hours Exactly zero for strongly consistent deployments. Theoretically unbounded, practically close to zero for eventual consistent deployments. DR Process Owner Often the Storage Team Application Team Capabilities needed for DR From storage (backup/restore, volume replication) From networking (east-west communication, global load balancer) The information in this table are generally accepted attributes and measurements for Disaster Recovery architectures
  3. CNDR - Reference Architecture 3 Traditional DR strategies are still

    possible in the cloud. Here we are focusing on a new approach.
  4. Availability and Consistency 4 High Availability (HA) is a property

    of a system that allows it to continue performing normally in the presence of failures. What happens when a component in a Failure Domain is lost? Some definitions Consistency is the property of a distributed stateful workload by which all of the instances of the workload “observe” the same state. Consistency Disaster recovery (DR) refers to the strategy for recovering from the complete loss of a datacenter. What happens when an entire Failure Domain is lost? Disaster Recovery Failure domains are areas which may fail due to a single event. Examples: nodes, racks, kubernetes clusters, network zones and datacenters Failure Domain High-Availability
  5. CAP Theorem 5 Product CAP Choice (either Availability or Consistency)

    DynamoDB Availability Cassandra Availability CockroachDB Consistency MongoDB Consistency PACELC corollary: in the absence of network partition, one can only optimize either for latency or consistency
  6. Consensus Protocols 6 Consensus Protocols allow for the coordination of

    distributed processes by agreeing on actions to be taken. Apache Bookkeeper is an example of Reliable Replicated Data Store (for log abstraction use case: append only) Building on consensus protocols and the concept of sharing a log of operations, it is possible to build a Reliable Replicated Data Store Reliable Replicated Data Store Protocols in which all participants perform the same action. They are implemented around the concepts of leader election and strict majority: Paxos, Raft. Shared State Consensus Protocols Protocols in which all participants perform different actions. They require the acknowledgment of all participants and are vulnerable to network partitioning: 2PC, 3PC Unshared State Consensus Protocols
  7. Anatomy of a Stateful Application 7 Partitions are a way

    to increase the general throughput of the workload. This is achieved by breaking the state space in partitions or shards. Partitions Putting it all together Stateful Workload Logical Tiers Replicas are a way to increase availability of a stateful workload. Replicas
  8. Examples of Consensus Protocol choices 8 Product Replica consensus protocol

    Shard consensus protocol Etcd Raft N/A (no support for shards) Consul Raft N/A (no support for shards) Zookeeper Atomic Broadcast (a derivative of Paxos) N/A (no support for shards) ElasticSearch Paxos N/A (No support for transactions) Cassandra Paxos Supported, but details are not available. MongoDB Paxos Homegrown protocol. CockroachDB Raft 2PC YugabyteDB Raft 2PC TiKV Raft Percolator Spanner Raft 2PC+high-precision time service Kafka A custom derivative of PacificA Custom Implementation of 2PC
  9. Strongly-Consistent vs Eventually Consistent CNDR 9 Concern Strongly-Consistent Eventually-Consistent RPO

    Zero Theoretically unbounded, practically close to zero. Temporarily inconsistency can happen. Note: eventual consistency does not mean eventual correctness. RTO Few seconds. Few seconds. Latency String sensitivity to latency between failure domains, single transaction latency will be >= 2 x worst latency between failure domains. No sensitivity to latency between failure domains. Throughput Theoretically scales linearly with the number of instances, practically is dependent on the workload type and the max throughput available between failure domains. Theoretically scales linearly with the number of instances, practically is dependent on the workload type. Minimum required failure domains three two
  10. CNDR -- Strong Consistency - Kubernetes Reference Architecture 10

  11. CNDR -- Eventual Consistency - Kubernetes Reference Architecture 11

  12. References 12 TAG Storage Cloud Native Disaster Recovery Demos and

    reference implementations: Geographically Distributed Stateful Workloads Part One: Cluster Preparation Geographically Distributed Stateful Workloads Part Two: CockroachDB Geographically Distributed Stateful Workloads - Part 3: Keycloak
  13. Thank you 13

  14. Short Demo 14

  15. Demo Scenario 15

  16. DR Simulation 16