Considerations for running Distributed SQL Databases on Kubernetes

© 2020 - All Rights Reserved 1 YugabyteDB – Distributed
SQL Database on Kubernetes Amey Banarse VP of Data Engineering, Yugabyte, Inc. Aug 3rd, 2021

© 2020 - All Rights Reserved 2 Introductions VP of
Data Engineering, Yugabyte Pivotal • FINRA • NYSE University of Pennsylvania (UPenn) @ameybanarse about.me/amey Amey Banarse

© 2020 - All Rights Reserved 3 Kubernetes Is Massively
Popular in Fortune 500s • Walmart – Edge Computing KubeCon 2019 https://www.youtube.com/watch?v=sfPFrvDvdlk • Target – Data @ Edge https://tech.target.com/2018/08/08/running-cassandra-in-kubernetes -across-1800-stores.html • eBay – Platform Modernization https://www.ebayinc.com/stories/news/ebay-builds-own-servers-intends -to-open-source/

© 2020 - All Rights Reserved 4 The State of
Kubernetes 2020 • Substantial growth in large enterprises ◦ Being used in production environments • On-premises deployments still most common • There are pain points, but most developers and executives feel k8s investment is worth it

© 2020 - All Rights Reserved 5 Data on K8s
Ecosystem Is Evolving Rapidly

© 2020 - All Rights Reserved 6 Why run a
DB in K8s?

© 2020 - All Rights Reserved 7 Better resource utilization
• Reduce cost with better packing of DBs • Useful when running large number of DBs ◦ Multi-tenant applications with a DB per tenant ◦ Self-service private DBaaS • But watch out for noisy neighbors ◦ Perf issues when running critical production workloads Node #1 Node #2 Node #3

© 2020 - All Rights Reserved 8 Resize pod resources
dynamically • Dynamically change CPU, memory • Embrace Automation - done without incurring downtime ◦ Scale DB with workload ◦ Automate to scale up automatically $ kubectl apply -f cpu-request-limit.yaml $ kubectl apply -f memory-request-limit.yaml

© 2020 - All Rights Reserved 9 Portability between clouds
and on-premises • Infrastructure as code • Works in a similar fashion on any cloud ◦ Cloud-provider managed k8s (AKS, EKS, GKE) ◦ Self-managed k8s (public/private cloud) • But not perfectly portable ◦ Need to understand some cloud speciﬁc constructs (Example: volume types, load balancers)

© 2020 - All Rights Reserved 10 Out of box
infrastructure orchestration • Pods the fail are automatically restarted • Pods are resized across nodes in cluster ◦ Optimal resource utilization ◦ Specify policies in code (example: anti-afﬁnity) • Loss of some ﬂexibility ◦ Cannot make permanent changes on pods

© 2020 - All Rights Reserved 11 Automating day 2
operations • Robust automation with CRDs (Custom Resource Deﬁnitions) or commonly referred as ‘K8s Operator’ • Easy to build an operator for ops ◦ Periodic backups ◦ DB software upgrades • Automating failover of traditional RDBMS can be dangerous ◦ Potential for data loss? ◦ Mitigation: use a distributed DB

© 2020 - All Rights Reserved 12 Why NOT run
a DB in K8s?

© 2020 - All Rights Reserved 13 Greater chance of
pod failures • Pods fail more often than VMs or bare metal • Many reasons for increased failure rate ◦ Process failures - conﬁg issues or bugs ◦ Out of memory and the OOM Killer ◦ Transparent rescheduling of pods • Will pod failures cause disruption of the service or data loss? ◦ Mitigation: use a distributed DB x Node #1 Node #2 Node #3 Data loss likely if local storage used by pod

© 2020 - All Rights Reserved 14 Local vs persistent
storage • Local storage = use local disk on the node ◦ Not replicated, but higher performance ◦ Data not present in new pod location • Persistent storage = use replicated storage ◦ Data visible to pod after it moves to new node ◦ What to do for on-prem? Use software solution (additional complexity) • Mitigation: use a distributed DB Node #1 Node #2 Node #3 Disk 1 Disk 3 Pod sees a new, empty disk (Disk 3) after move with local storage

© 2020 - All Rights Reserved 15 Need for a
load balancer • Restricted cluster ingress in k8s ◦ If app not on same k8s cluster, needs LB • Needs load balancer to expose DB externally ◦ Not an issue on public clouds - use cloud-provider network LBs ◦ But there may be per-cloud limits on NLBs and public IP address limits • Bigger problem on-prem with hardware based load balancers (Example: F5) Node #1 Node #2 Node #3 Load balancer to access any DB service

© 2020 - All Rights Reserved 16 Networking complexities •
Two k8s clusters cannot “see” each other • Network discovery and reachability issues ◦ Pods of one k8s cluster cannot refer and replicate to pods in another k8s cluster by default • Mitigation #1: use DNS chaining today (operational complexity, depends on env) • Mitigation #2: use service mesh like Istio (but lower performance - HTTP layer vs TCP) Replication ? Video: Kubecon EU 2021 - Building the Multi-Cluster Data Layer

© 2020 - All Rights Reserved 17 Running a Distributed
SQL DB in k8s (YugabyteDB) • Better resource utilization • Resize pod resources dynamically • Portability (cloud and on-premises) • Out of box infrastructure orchestration • Automate day 2 DB operations • Greater chance of pod failures • Local storage vs persistent storage • Need for a load balancer • Networking complexities • Operational maturity curve

© 2020 - All Rights Reserved 18 Transactional, distributed SQL
database designed for resilience and scale 100% open source, PostgreSQL compatible, enterprise-grade RDBMS …..built to run across all your cloud environments

© 2020 - All Rights Reserved 19 A Brief History
of Yugabyte Part of Facebook’s cloud native DB evolution • Yugabyte team dealt with this growth ﬁrst hand • Massive geo-distributed deployment given global users • Worked with world-class infra team to solve these issues Builders of multiple popular databases +1 Trillion ops/day +100 Petabytes data set sizes Yugabyte founding team ran Facebook’s public cloud scale DBaaS

© 2020 - All Rights Reserved Designing the perfect Distributed
SQL Database 20 Aurora much more popular than Spanner Amazon Aurora Google Spanner A highly available MySQL and PostgreSQL-compatible relational database service Not scalable but HA All RDBMS features PostgreSQL & MySQL The ﬁrst horizontally scalable, strongly consistent, relational database service Scalable and HA Missing RDBMS features New SQL syntax bit.ly/distributed-sql-deconstructed Skyrocketing adoption of PostgreSQL for cloud-native applications

© 2020 - All Rights Reserved Designed for cloud native
microservices. 21 Sharding & Load Balancing Raft Consensus Replication Distributed Transaction Manager & MVCC Document Storage Layer Custom RocksDB Storage Engine DocDB Distributed Document Store Yugabyte Query Layer YSQL YCQL PostgreSQL Google Spanner YugabyteDB SQL Ecosystem ✓ Massively adopted ✘ New SQL ﬂavor ✓ Reuse PostgreSQL RDBMS Features ✓ Advanced Complex ✘ Basic cloud-native ✓ Advanced Complex and cloud-native Highly Available ✘ ✓ ✓ Horizontal Scale ✘ ✓ ✓ Distributed Txns ✘ ✓ ✓ Data Replication Async Sync Sync + Async

© 2020 - All Rights Reserved Design Goal: support all
RDBMS features What’s supported today (Yugabyte v2.7) Impossible without reusing PostgreSQL code! Amazon Aurora uses this strategy. Other distributed SQL databases do not support most of these features. Building these features from ground-up is: • Hard to build robust functional spec • Takes a lot of time to implement • Takes even longer for users to adopt and mature

© 2020 - All Rights Reserved Layered Architecture DocDB Storage
Layer Distributed, transactional document store with sync and async replication support YSQL A fully PostgreSQL compatible relational API YCQL Cassandra compatible semi-relational API Extensible Query Layer Extensible query layer to support multiple API’s Microservice requiring relational integrity Microservice requiring massive scale Microservice requiring geo-distribution of data Extensible query layer ◦ YSQL: PostgreSQL-based ◦ YCQL: Cassandra-based Transactional storage layer ◦ Transactional ACID compliant ◦ Resilient and scalable ◦ Document storage

© 2020 - All Rights Reserved 1. Single Region, Multi-Zone
Availability Zone 1 Availability Zone 2 Availability Zone 3 Consistent Across Zones No WAN Latency But No Region-Level Failover/Repair 2. Single Cloud, Multi-Region Region 1 Region 2 Region 3 Consistent Across Regions with Auto Region-Level Failover/Repair 3. Multi-Cloud, Multi-Region Cloud 1 Cloud 2 Cloud 3 Consistent Across Clouds with Auto Cloud-Level Failover/Repair Resilient and strongly consistent across failure domains

© 2020 - All Rights Reserved 25 YugabyteDB on K8s
Architecture

© 2020 - All Rights Reserved YugabyteDB Deployed as StatefulSets
26 node2 node1 node4 node3 yb-master StatefulSet yugabytedb yb-master-1 pod yugabytedb yb-master-0 pod yugabytedb yb-master-2 pod yb-tserver StatefulSet tablet 1’ yugabytedb yb-tserver-1 pod tablet 1’ yugabytedb yb-tserver-0 pod tablet 1’ yugabytedb yb-tserver-3 pod tablet 1’ yugabytedb yb-tserver-2 pod … Local/Remote Persistent Volume Local/Remote Persistent Volume Local/Remote Persistent Volume Local/Remote Persistent Volume yb-masters Headless Service yb-tservers Headless Service App Clients Admin Clients

© 2020 - All Rights Reserved Under the Hood –
3 Node Cluster 27 DocDB Storage Engine Purpose-built for ever-growing data, extended from RocksDB yb-master1 yb-master3 yb-master2 YB-Master Manage shard metadata & coordinate cluster-wide ops Worker node1 Worker node3 Worker node2 Global Transaction Manager Tracks ACID txns across multi-row ops, incl. clock skew mgmt. Raft Consensus Replication Highly resilient, used for both data replication & leader election tablet 1’ tablet 1’ yb-tserver1 yb-tserver2 yb-tserver3 tablet 1’ tablet2-leader tablet3-leader tablet1-leader YB-TServer Stores/serves data in/from tablets (shards) tablet1-follower tablet1-follower tablet3-follower tablet2-follower tablet3-follower tablet2-follower … … … YB Helm Charts at charts.yugabyte.com

© 2020 - All Rights Reserved 28 Deployed on all
popular Kubernetes platforms

Demo Single YB Universe Deployed on 3 separate GKE Clusters

© 2020 - All Rights Reserved YugabyteDB Universe on 3
GKE Clusters Deployment: 3 GKE clusters Each with 3 x N1 Standard 8 nodes 3 pods in each cluster using 4 cores Cores: 4 cores per pod Memory: 7.5 GB per pod Disk: ~ 500 GB total for universe 30

© 2020 - All Rights Reserved “Cloud native technologies empower
organizations to build and run scalable applications in modern, dynamic environments such as public, private and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure and declarative APIs exemplify this approach. These techniques enable loosely coupled systems that are resilient, manageable and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.” Cloud Native - cncf.io deﬁnition

© 2021 All Rights Reserved • Database Reliability Engineering ◦
Inspired by Google’s SRE model ◦ Blending DevOps culture with DBA teams ◦ Infrastructure as code ◦ Automation is the key Introducing DBRE model

© 2021 All Rights Reserved • Responsibility of the data
shared by cross-functional teams • Provide patterns and knowledge to support other team’s processes to facilitate their work • Deﬁning reference architectures and conﬁgurations for data stores that are approved for operations, and can be deployed by teams. DBRE Guiding Principles

Multi-Region Requirements • Pod to pod communication over TCP ports using RPC calls across n K8s clusters • Global DNS Resolution system ◦ Across all the K8s clusters so that pods in one cluster can connect to pods in other clusters • Ability to create load balancers in each region/DB • RBAC: ClusterRole and ClusterRoleBinding • Reference: Deploy YugabyteDB on multi cluster GKE https://docs.yugabyte.com/latest/deploy/kubernetes/multi-cluster/gke/helm-chart/

© 2020 - All Rights Reserved 36 Ensuring High Performance
LOCAL STORAGE REMOTE STORAGE Lower latency, Higher throughput Recommended for workloads that do their own replication Pre-provision outside of K8s Use SSDs for latency-sensitive apps Higher latency, Lower throughput Recommended for workloads do not perform any replication on their own Provision dynamically in K8s Use alongside local storage for cost-efﬁcient tiering Most used

© 2020 - All Rights Reserved 37 Conﬁguring Data Resilience
POD ANTI-AFFINITY MULTI-ZONE/REGIONAL/MULTI-REGION POD SCHEDULING Pods of the same type should not be scheduled on the same node Keeps impact of node failures to absolute minimum Multi-Zone – Tolerate zone failures for K8s worker nodes Regional – Tolerate zone failures for both K8s worker and master nodes Multi-Region / Multi-Cluster – Requires network discovery between multi cluster

© 2020 - All Rights Reserved 38 BACKUP & RESTORE
Backups and restores are a database level construct YugabyteDB can perform distributed snapshot and copy to a target for a backup Restore the backup into an existing cluster or a new cluster with a different number of TServers ROLLING UPGRADES Supports two upgradeStrategies: onDelete (default) and rollingUpgrade Pick rolling upgrade strategy for DBs that support zero downtime upgrades such as YugabyteDB New instance of the pod spawned with same network id and storage HANDLING FAILURES Pod failure handled by K8s automatically Node failure has to be handled manually by adding a new slave node to K8s cluster Local storage failure has to be handled manually by mounting new local volume to K8s Automating Day 2 Operations

© 2020 - All Rights Reserved 39 https://github.com/yugabyte/yugabyte-platform-operator Based on
Custom Controllers that have direct access to lower level K8S API Excellent ﬁt for stateful apps requiring human operational knowledge to correctly scale, reconﬁgure and upgrade while simultaneously ensuring high performance and data resilience Complementary to Helm for packaging Auto-scaling with k8s operators CPU usage in the yb-tserver StatefulSet Scale pods CPU > 80% for 1min and max_threshold not exceeded

© 2020 All Rights Reserved Performance parity across Kubernetes and
VMs - TPCC workloads 40 TPCC Workload VMs (AWS) Kubernetes (GKE) Topology 3 x c5.4xlarge( 16 vCPUs, 32 GiB RAM, 400 GB SSD) 3 x TServer Pods (16 vCPUs, 15 GB RAM, 400 GB SSD) tpmC 12,597.63 12,299.60 Efficiency 97.96% 95.64% Throughput 469.06 requests/sec 462.32 requests/sec Latency New Order Avg Latency: 33.313 ms P99 Latency: 115.446 ms Payment Avg Latency: 24.735 ms P99 Latency: 86.051 ms OrderStatus Avg Latency: 14.357 ms P99 Latency: 43.475 ms Delivery Avg Latency: 66.522 ms P99 Latency: 205.065 ms StockLevel Avg Latency: 212.180 ms P99 Latency: 670.487 ms New Order Avg Latency: 59.66 ms P99 Latency: 478.89 ms Payment Avg Latency: 33.53 ms P99 Latency: 248.07 ms OrderStatus Avg Latency: 14.65 ms P99 Latency: 100.48 ms Delivery Avg Latency: 148.36 ms P99 Latency: 838.42 ms StockLevel Avg Latency: 99.99 ms P99 Latency: 315.38 ms

© 2021 All Rights Reserved Target Use Cases 41 Systems
of Record and Engagement Event Data and IoT Geo-Distributed Workloads Resilient, business critical data Handling massive scale Sync, async, and geo replication • Identity management • User/account proﬁle • eCommerce apps - checkout, shopping cart • Real time payment systems • Vehicle telemetry • Stock bids and asks • Shipment information • Credit card transactions • Vehicle telemetry • Stock bids and asks • Shipment information • Credit card transactions

© 2020 - All Rights Reserved Istio Trafﬁc Management for
Microservices 46 CART MICROSERVICE PRODUCT MICROSERVICE API Gateway CHECKOUT MICROSERVICE UIU UI APP Galley Citadel Pilot Istio Edge Proxy Istio Control Plane Istio Service Discovery Istio Edge Gateway Istio Route Configuration using Envoy Proxy

© 2020 - All Rights Reserved The fastest growing Distributed
SQL Database Slack users ▲ 3K We 💛 stars! Give us one: github.com/YugaByte/yugabyte-db Join our community: yugabyte.com/slack Clusters deployed ▲ 600K

Considerations for running Distributed SQL Data...

Considerations for running Distributed SQL Databases on Kubernetes

More Decks by AMEY BANARSE

Other Decks in Technology

Featured

Transcript