Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Considerations for running Distributed SQL Databases on Kubernetes

Considerations for running Distributed SQL Databases on Kubernetes

Kubernetes has hit a home run for stateless workloads, but can it do the same for stateful services such as distributed databases? Before we can answer that question, we need to understand the challenges of running stateful workloads on, well anything. In this talk, we will first look at which stateful workloads, specifically databases, are ideal for running inside Kubernetes. Secondly, we will explore the various concerns around running databases in Kubernetes for production environments, such as: - The production-readiness of Kubernetes for stateful workloads in general - The pros and cons of the various deployment architectures - The failure characteristics of a distributed database inside containers In this session, we will demonstrate what Kubernetes brings to the table for stateful workloads and what database servers must provide to fit the Kubernetes model. This talk will also highlight some of the modern databases that take full advantage of Kubernetes and offer a peek into what’s possible if stateful services can meet Kubernetes halfway. We will go into the details of deployment choices, how the different cloud-vendor managed container offerings differ in what they offer, as well as compare performance and failure characteristics of a Kubernetes-based deployment with an equivalent VM-based deployment.

-How different kinds of databases work on Kubernetes
- The production-readiness of Kubernetes for stateful workloads in general
- The pros and cons of the various deployment architectures
- The failure characteristics of a distributed database inside containers


August 03, 2021

More Decks by AMEY BANARSE

Other Decks in Technology


  1. © 2020 - All Rights Reserved 1 YugabyteDB – Distributed

    SQL Database on Kubernetes Amey Banarse VP of Data Engineering, Yugabyte, Inc. Aug 3rd, 2021
  2. © 2020 - All Rights Reserved 2 Introductions VP of

    Data Engineering, Yugabyte Pivotal • FINRA • NYSE University of Pennsylvania (UPenn) @ameybanarse about.me/amey Amey Banarse
  3. © 2020 - All Rights Reserved 3 Kubernetes Is Massively

    Popular in Fortune 500s • Walmart – Edge Computing KubeCon 2019 https://www.youtube.com/watch?v=sfPFrvDvdlk • Target – Data @ Edge https://tech.target.com/2018/08/08/running-cassandra-in-kubernetes -across-1800-stores.html • eBay – Platform Modernization https://www.ebayinc.com/stories/news/ebay-builds-own-servers-intends -to-open-source/
  4. © 2020 - All Rights Reserved 4 The State of

    Kubernetes 2020 • Substantial growth in large enterprises ◦ Being used in production environments • On-premises deployments still most common • There are pain points, but most developers and executives feel k8s investment is worth it
  5. © 2020 - All Rights Reserved 5 Data on K8s

    Ecosystem Is Evolving Rapidly
  6. © 2020 - All Rights Reserved 7 Better resource utilization

    • Reduce cost with better packing of DBs • Useful when running large number of DBs ◦ Multi-tenant applications with a DB per tenant ◦ Self-service private DBaaS • But watch out for noisy neighbors ◦ Perf issues when running critical production workloads Node #1 Node #2 Node #3
  7. © 2020 - All Rights Reserved 8 Resize pod resources

    dynamically • Dynamically change CPU, memory • Embrace Automation - done without incurring downtime ◦ Scale DB with workload ◦ Automate to scale up automatically $ kubectl apply -f cpu-request-limit.yaml $ kubectl apply -f memory-request-limit.yaml
  8. © 2020 - All Rights Reserved 9 Portability between clouds

    and on-premises • Infrastructure as code • Works in a similar fashion on any cloud ◦ Cloud-provider managed k8s (AKS, EKS, GKE) ◦ Self-managed k8s (public/private cloud) • But not perfectly portable ◦ Need to understand some cloud specific constructs (Example: volume types, load balancers)
  9. © 2020 - All Rights Reserved 10 Out of box

    infrastructure orchestration • Pods the fail are automatically restarted • Pods are resized across nodes in cluster ◦ Optimal resource utilization ◦ Specify policies in code (example: anti-affinity) • Loss of some flexibility ◦ Cannot make permanent changes on pods
  10. © 2020 - All Rights Reserved 11 Automating day 2

    operations • Robust automation with CRDs (Custom Resource Definitions) or commonly referred as ‘K8s Operator’ • Easy to build an operator for ops ◦ Periodic backups ◦ DB software upgrades • Automating failover of traditional RDBMS can be dangerous ◦ Potential for data loss? ◦ Mitigation: use a distributed DB
  11. © 2020 - All Rights Reserved 13 Greater chance of

    pod failures • Pods fail more often than VMs or bare metal • Many reasons for increased failure rate ◦ Process failures - config issues or bugs ◦ Out of memory and the OOM Killer ◦ Transparent rescheduling of pods • Will pod failures cause disruption of the service or data loss? ◦ Mitigation: use a distributed DB x Node #1 Node #2 Node #3 Data loss likely if local storage used by pod
  12. © 2020 - All Rights Reserved 14 Local vs persistent

    storage • Local storage = use local disk on the node ◦ Not replicated, but higher performance ◦ Data not present in new pod location • Persistent storage = use replicated storage ◦ Data visible to pod after it moves to new node ◦ What to do for on-prem? Use software solution (additional complexity) • Mitigation: use a distributed DB Node #1 Node #2 Node #3 Disk 1 Disk 3 Pod sees a new, empty disk (Disk 3) after move with local storage
  13. © 2020 - All Rights Reserved 15 Need for a

    load balancer • Restricted cluster ingress in k8s ◦ If app not on same k8s cluster, needs LB • Needs load balancer to expose DB externally ◦ Not an issue on public clouds - use cloud-provider network LBs ◦ But there may be per-cloud limits on NLBs and public IP address limits • Bigger problem on-prem with hardware based load balancers (Example: F5) Node #1 Node #2 Node #3 Load balancer to access any DB service
  14. © 2020 - All Rights Reserved 16 Networking complexities •

    Two k8s clusters cannot “see” each other • Network discovery and reachability issues ◦ Pods of one k8s cluster cannot refer and replicate to pods in another k8s cluster by default • Mitigation #1: use DNS chaining today (operational complexity, depends on env) • Mitigation #2: use service mesh like Istio (but lower performance - HTTP layer vs TCP) Replication ? Video: Kubecon EU 2021 - Building the Multi-Cluster Data Layer
  15. © 2020 - All Rights Reserved 17 Running a Distributed

    SQL DB in k8s (YugabyteDB) • Better resource utilization • Resize pod resources dynamically • Portability (cloud and on-premises) • Out of box infrastructure orchestration • Automate day 2 DB operations • Greater chance of pod failures • Local storage vs persistent storage • Need for a load balancer • Networking complexities • Operational maturity curve
  16. © 2020 - All Rights Reserved 18 Transactional, distributed SQL

    database designed for resilience and scale 100% open source, PostgreSQL compatible, enterprise-grade RDBMS …..built to run across all your cloud environments
  17. © 2020 - All Rights Reserved 19 A Brief History

    of Yugabyte Part of Facebook’s cloud native DB evolution • Yugabyte team dealt with this growth first hand • Massive geo-distributed deployment given global users • Worked with world-class infra team to solve these issues Builders of multiple popular databases +1 Trillion ops/day +100 Petabytes data set sizes Yugabyte founding team ran Facebook’s public cloud scale DBaaS
  18. © 2020 - All Rights Reserved Designing the perfect Distributed

    SQL Database 20 Aurora much more popular than Spanner Amazon Aurora Google Spanner A highly available MySQL and PostgreSQL-compatible relational database service Not scalable but HA All RDBMS features PostgreSQL & MySQL The first horizontally scalable, strongly consistent, relational database service Scalable and HA Missing RDBMS features New SQL syntax bit.ly/distributed-sql-deconstructed Skyrocketing adoption of PostgreSQL for cloud-native applications
  19. © 2020 - All Rights Reserved Designed for cloud native

    microservices. 21 Sharding & Load Balancing Raft Consensus Replication Distributed Transaction Manager & MVCC Document Storage Layer Custom RocksDB Storage Engine DocDB Distributed Document Store Yugabyte Query Layer YSQL YCQL PostgreSQL Google Spanner YugabyteDB SQL Ecosystem ✓ Massively adopted ✘ New SQL flavor ✓ Reuse PostgreSQL RDBMS Features ✓ Advanced Complex ✘ Basic cloud-native ✓ Advanced Complex and cloud-native Highly Available ✘ ✓ ✓ Horizontal Scale ✘ ✓ ✓ Distributed Txns ✘ ✓ ✓ Data Replication Async Sync Sync + Async
  20. © 2020 - All Rights Reserved Design Goal: support all

    RDBMS features What’s supported today (Yugabyte v2.7) Impossible without reusing PostgreSQL code! Amazon Aurora uses this strategy. Other distributed SQL databases do not support most of these features. Building these features from ground-up is: • Hard to build robust functional spec • Takes a lot of time to implement • Takes even longer for users to adopt and mature
  21. © 2020 - All Rights Reserved Layered Architecture DocDB Storage

    Layer Distributed, transactional document store with sync and async replication support YSQL A fully PostgreSQL compatible relational API YCQL Cassandra compatible semi-relational API Extensible Query Layer Extensible query layer to support multiple API’s Microservice requiring relational integrity Microservice requiring massive scale Microservice requiring geo-distribution of data Extensible query layer ◦ YSQL: PostgreSQL-based ◦ YCQL: Cassandra-based Transactional storage layer ◦ Transactional ACID compliant ◦ Resilient and scalable ◦ Document storage
  22. © 2020 - All Rights Reserved 1. Single Region, Multi-Zone

    Availability Zone 1 Availability Zone 2 Availability Zone 3 Consistent Across Zones No WAN Latency But No Region-Level Failover/Repair 2. Single Cloud, Multi-Region Region 1 Region 2 Region 3 Consistent Across Regions with Auto Region-Level Failover/Repair 3. Multi-Cloud, Multi-Region Cloud 1 Cloud 2 Cloud 3 Consistent Across Clouds with Auto Cloud-Level Failover/Repair Resilient and strongly consistent across failure domains
  23. © 2020 - All Rights Reserved YugabyteDB Deployed as StatefulSets

    26 node2 node1 node4 node3 yb-master StatefulSet yugabytedb yb-master-1 pod yugabytedb yb-master-0 pod yugabytedb yb-master-2 pod yb-tserver StatefulSet tablet 1’ yugabytedb yb-tserver-1 pod tablet 1’ yugabytedb yb-tserver-0 pod tablet 1’ yugabytedb yb-tserver-3 pod tablet 1’ yugabytedb yb-tserver-2 pod … Local/Remote Persistent Volume Local/Remote Persistent Volume Local/Remote Persistent Volume Local/Remote Persistent Volume yb-masters Headless Service yb-tservers Headless Service App Clients Admin Clients
  24. © 2020 - All Rights Reserved Under the Hood –

    3 Node Cluster 27 DocDB Storage Engine Purpose-built for ever-growing data, extended from RocksDB yb-master1 yb-master3 yb-master2 YB-Master Manage shard metadata & coordinate cluster-wide ops Worker node1 Worker node3 Worker node2 Global Transaction Manager Tracks ACID txns across multi-row ops, incl. clock skew mgmt. Raft Consensus Replication Highly resilient, used for both data replication & leader election tablet 1’ tablet 1’ yb-tserver1 yb-tserver2 yb-tserver3 tablet 1’ tablet2-leader tablet3-leader tablet1-leader YB-TServer Stores/serves data in/from tablets (shards) tablet1-follower tablet1-follower tablet3-follower tablet2-follower tablet3-follower tablet2-follower … … … YB Helm Charts at charts.yugabyte.com
  25. © 2020 - All Rights Reserved 29 YugabyteDB on K8s

    Demo Single YB Universe Deployed on 3 separate GKE Clusters
  26. © 2020 - All Rights Reserved YugabyteDB Universe on 3

    GKE Clusters Deployment: 3 GKE clusters Each with 3 x N1 Standard 8 nodes 3 pods in each cluster using 4 cores Cores: 4 cores per pod Memory: 7.5 GB per pod Disk: ~ 500 GB total for universe 30
  27. © 2020 - All Rights Reserved “Cloud native technologies empower

    organizations to build and run scalable applications in modern, dynamic environments such as public, private and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure and declarative APIs exemplify this approach. These techniques enable loosely coupled systems that are resilient, manageable and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.” Cloud Native - cncf.io definition
  28. © 2021 All Rights Reserved • Database Reliability Engineering ◦

    Inspired by Google’s SRE model ◦ Blending DevOps culture with DBA teams ◦ Infrastructure as code ◦ Automation is the key Introducing DBRE model
  29. © 2021 All Rights Reserved • Responsibility of the data

    shared by cross-functional teams • Provide patterns and knowledge to support other team’s processes to facilitate their work • Defining reference architectures and configurations for data stores that are approved for operations, and can be deployed by teams. DBRE Guiding Principles
  30. © 2020 - All Rights Reserved 35 YugabyteDB on K8s

    Multi-Region Requirements • Pod to pod communication over TCP ports using RPC calls across n K8s clusters • Global DNS Resolution system ◦ Across all the K8s clusters so that pods in one cluster can connect to pods in other clusters • Ability to create load balancers in each region/DB • RBAC: ClusterRole and ClusterRoleBinding • Reference: Deploy YugabyteDB on multi cluster GKE https://docs.yugabyte.com/latest/deploy/kubernetes/multi-cluster/gke/helm-chart/
  31. © 2020 - All Rights Reserved 36 Ensuring High Performance

    LOCAL STORAGE REMOTE STORAGE Lower latency, Higher throughput Recommended for workloads that do their own replication Pre-provision outside of K8s Use SSDs for latency-sensitive apps Higher latency, Lower throughput Recommended for workloads do not perform any replication on their own Provision dynamically in K8s Use alongside local storage for cost-efficient tiering Most used
  32. © 2020 - All Rights Reserved 37 Configuring Data Resilience

    POD ANTI-AFFINITY MULTI-ZONE/REGIONAL/MULTI-REGION POD SCHEDULING Pods of the same type should not be scheduled on the same node Keeps impact of node failures to absolute minimum Multi-Zone – Tolerate zone failures for K8s worker nodes Regional – Tolerate zone failures for both K8s worker and master nodes Multi-Region / Multi-Cluster – Requires network discovery between multi cluster
  33. © 2020 - All Rights Reserved 38 BACKUP & RESTORE

    Backups and restores are a database level construct YugabyteDB can perform distributed snapshot and copy to a target for a backup Restore the backup into an existing cluster or a new cluster with a different number of TServers ROLLING UPGRADES Supports two upgradeStrategies: onDelete (default) and rollingUpgrade Pick rolling upgrade strategy for DBs that support zero downtime upgrades such as YugabyteDB New instance of the pod spawned with same network id and storage HANDLING FAILURES Pod failure handled by K8s automatically Node failure has to be handled manually by adding a new slave node to K8s cluster Local storage failure has to be handled manually by mounting new local volume to K8s Automating Day 2 Operations
  34. © 2020 - All Rights Reserved 39 https://github.com/yugabyte/yugabyte-platform-operator Based on

    Custom Controllers that have direct access to lower level K8S API Excellent fit for stateful apps requiring human operational knowledge to correctly scale, reconfigure and upgrade while simultaneously ensuring high performance and data resilience Complementary to Helm for packaging Auto-scaling with k8s operators CPU usage in the yb-tserver StatefulSet Scale pods CPU > 80% for 1min and max_threshold not exceeded
  35. © 2020 All Rights Reserved Performance parity across Kubernetes and

    VMs - TPCC workloads 40 TPCC Workload VMs (AWS) Kubernetes (GKE) Topology 3 x c5.4xlarge( 16 vCPUs, 32 GiB RAM, 400 GB SSD) 3 x TServer Pods (16 vCPUs, 15 GB RAM, 400 GB SSD) tpmC 12,597.63 12,299.60 Efficiency 97.96% 95.64% Throughput 469.06 requests/sec 462.32 requests/sec Latency New Order Avg Latency: 33.313 ms P99 Latency: 115.446 ms Payment Avg Latency: 24.735 ms P99 Latency: 86.051 ms OrderStatus Avg Latency: 14.357 ms P99 Latency: 43.475 ms Delivery Avg Latency: 66.522 ms P99 Latency: 205.065 ms StockLevel Avg Latency: 212.180 ms P99 Latency: 670.487 ms New Order Avg Latency: 59.66 ms P99 Latency: 478.89 ms Payment Avg Latency: 33.53 ms P99 Latency: 248.07 ms OrderStatus Avg Latency: 14.65 ms P99 Latency: 100.48 ms Delivery Avg Latency: 148.36 ms P99 Latency: 838.42 ms StockLevel Avg Latency: 99.99 ms P99 Latency: 315.38 ms
  36. © 2021 All Rights Reserved Target Use Cases 41 Systems

    of Record and Engagement Event Data and IoT Geo-Distributed Workloads Resilient, business critical data Handling massive scale Sync, async, and geo replication • Identity management • User/account profile • eCommerce apps - checkout, shopping cart • Real time payment systems • Vehicle telemetry • Stock bids and asks • Shipment information • Credit card transactions • Vehicle telemetry • Stock bids and asks • Shipment information • Credit card transactions
  37. © 2020 - All Rights Reserved Modern Cloud Native Application

    Stack Deployed on https://github.com/yugabyte/yugastore-java
  38. © 2020 - All Rights Reserved Microservices Architecture 45 CART

  39. © 2020 - All Rights Reserved Istio Traffic Management for

    Microservices 46 CART MICROSERVICE PRODUCT MICROSERVICE API Gateway CHECKOUT MICROSERVICE UIU UI APP Galley Citadel Pilot Istio Edge Proxy Istio Control Plane Istio Service Discovery Istio Edge Gateway Istio Route Configuration using Envoy Proxy
  40. © 2020 - All Rights Reserved The fastest growing Distributed

    SQL Database Slack users ▲ 3K We 💛 stars! Give us one: github.com/YugaByte/yugabyte-db Join our community: yugabyte.com/slack Clusters deployed ▲ 600K
  41. © 2020 - All Rights Reserved 48 Thank You Join

    us on Slack: yugabyte.com/slack Star us on GitHub: github.com/yugabyte/yugabyte-db