Considerations for running Distributed SQL Databases on Kubernetes

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

© 2020 - All Rights Reserved 3 Kubernetes Is Massively Popular in Fortune 500s ● Walmart – Edge Computing KubeCon 2019 https://www.youtube.com/watch?v=sfPFrvDvdlk ● Target – Data @ Edge https://tech.target.com/2018/08/08/running-cassandra-in-kubernetes -across-1800-stores.html ● eBay – Platform Modernization https://www.ebayinc.com/stories/news/ebay-builds-own-servers-intends -to-open-source/

Slide 4

Slide 4 text

© 2020 - All Rights Reserved 4 The State of Kubernetes 2020 ● Substantial growth in large enterprises ○ Being used in production environments ● On-premises deployments still most common ● There are pain points, but most developers and executives feel k8s investment is worth it

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

© 2020 - All Rights Reserved 7 Better resource utilization ● Reduce cost with better packing of DBs ● Useful when running large number of DBs ○ Multi-tenant applications with a DB per tenant ○ Self-service private DBaaS ● But watch out for noisy neighbors ○ Perf issues when running critical production workloads Node #1 Node #2 Node #3

Slide 8

Slide 8 text

© 2020 - All Rights Reserved 8 Resize pod resources dynamically ● Dynamically change CPU, memory ● Embrace Automation - done without incurring downtime ○ Scale DB with workload ○ Automate to scale up automatically $ kubectl apply -f cpu-request-limit.yaml $ kubectl apply -f memory-request-limit.yaml

Slide 9

Slide 9 text

© 2020 - All Rights Reserved 9 Portability between clouds and on-premises ● Infrastructure as code ● Works in a similar fashion on any cloud ○ Cloud-provider managed k8s (AKS, EKS, GKE) ○ Self-managed k8s (public/private cloud) ● But not perfectly portable ○ Need to understand some cloud speciﬁc constructs (Example: volume types, load balancers)

Slide 10

Slide 10 text

© 2020 - All Rights Reserved 10 Out of box infrastructure orchestration ● Pods the fail are automatically restarted ● Pods are resized across nodes in cluster ○ Optimal resource utilization ○ Specify policies in code (example: anti-afﬁnity) ● Loss of some ﬂexibility ○ Cannot make permanent changes on pods

Slide 11

Slide 11 text

© 2020 - All Rights Reserved 11 Automating day 2 operations ● Robust automation with CRDs (Custom Resource Deﬁnitions) or commonly referred as ‘K8s Operator’ ● Easy to build an operator for ops ○ Periodic backups ○ DB software upgrades ● Automating failover of traditional RDBMS can be dangerous ○ Potential for data loss? ○ Mitigation: use a distributed DB

Slide 12

Slide 12 text

Slide 13

Slide 13 text

© 2020 - All Rights Reserved 13 Greater chance of pod failures ● Pods fail more often than VMs or bare metal ● Many reasons for increased failure rate ○ Process failures - conﬁg issues or bugs ○ Out of memory and the OOM Killer ○ Transparent rescheduling of pods ● Will pod failures cause disruption of the service or data loss? ○ Mitigation: use a distributed DB x Node #1 Node #2 Node #3 Data loss likely if local storage used by pod

Slide 14

Slide 14 text

© 2020 - All Rights Reserved 14 Local vs persistent storage ● Local storage = use local disk on the node ○ Not replicated, but higher performance ○ Data not present in new pod location ● Persistent storage = use replicated storage ○ Data visible to pod after it moves to new node ○ What to do for on-prem? Use software solution (additional complexity) ● Mitigation: use a distributed DB Node #1 Node #2 Node #3 Disk 1 Disk 3 Pod sees a new, empty disk (Disk 3) after move with local storage

Slide 15

Slide 15 text

© 2020 - All Rights Reserved 15 Need for a load balancer ● Restricted cluster ingress in k8s ○ If app not on same k8s cluster, needs LB ● Needs load balancer to expose DB externally ○ Not an issue on public clouds - use cloud-provider network LBs ○ But there may be per-cloud limits on NLBs and public IP address limits ● Bigger problem on-prem with hardware based load balancers (Example: F5) Node #1 Node #2 Node #3 Load balancer to access any DB service

Slide 16

Slide 16 text

© 2020 - All Rights Reserved 16 Networking complexities ● Two k8s clusters cannot “see” each other ● Network discovery and reachability issues ○ Pods of one k8s cluster cannot refer and replicate to pods in another k8s cluster by default ● Mitigation #1: use DNS chaining today (operational complexity, depends on env) ● Mitigation #2: use service mesh like Istio (but lower performance - HTTP layer vs TCP) Replication ? Video: Kubecon EU 2021 - Building the Multi-Cluster Data Layer

Slide 17

Slide 17 text

© 2020 - All Rights Reserved 17 Running a Distributed SQL DB in k8s (YugabyteDB) ● Better resource utilization ● Resize pod resources dynamically ● Portability (cloud and on-premises) ● Out of box infrastructure orchestration ● Automate day 2 DB operations ● Greater chance of pod failures ● Local storage vs persistent storage ● Need for a load balancer ● Networking complexities ● Operational maturity curve

Slide 18

Slide 18 text

© 2020 - All Rights Reserved 18 Transactional, distributed SQL database designed for resilience and scale 100% open source, PostgreSQL compatible, enterprise-grade RDBMS …..built to run across all your cloud environments

Slide 19

Slide 19 text

© 2020 - All Rights Reserved 19 A Brief History of Yugabyte Part of Facebook’s cloud native DB evolution ● Yugabyte team dealt with this growth ﬁrst hand ● Massive geo-distributed deployment given global users ● Worked with world-class infra team to solve these issues Builders of multiple popular databases +1 Trillion ops/day +100 Petabytes data set sizes Yugabyte founding team ran Facebook’s public cloud scale DBaaS

Slide 20

Slide 20 text

© 2020 - All Rights Reserved Designing the perfect Distributed SQL Database 20 Aurora much more popular than Spanner Amazon Aurora Google Spanner A highly available MySQL and PostgreSQL-compatible relational database service Not scalable but HA All RDBMS features PostgreSQL & MySQL The ﬁrst horizontally scalable, strongly consistent, relational database service Scalable and HA Missing RDBMS features New SQL syntax bit.ly/distributed-sql-deconstructed Skyrocketing adoption of PostgreSQL for cloud-native applications

Slide 21

Slide 21 text

© 2020 - All Rights Reserved Designed for cloud native microservices. 21 Sharding & Load Balancing Raft Consensus Replication Distributed Transaction Manager & MVCC Document Storage Layer Custom RocksDB Storage Engine DocDB Distributed Document Store Yugabyte Query Layer YSQL YCQL PostgreSQL Google Spanner YugabyteDB SQL Ecosystem ✓ Massively adopted ✘ New SQL ﬂavor ✓ Reuse PostgreSQL RDBMS Features ✓ Advanced Complex ✘ Basic cloud-native ✓ Advanced Complex and cloud-native Highly Available ✘ ✓ ✓ Horizontal Scale ✘ ✓ ✓ Distributed Txns ✘ ✓ ✓ Data Replication Async Sync Sync + Async

Slide 22

Slide 22 text

© 2020 - All Rights Reserved Design Goal: support all RDBMS features What’s supported today (Yugabyte v2.7) Impossible without reusing PostgreSQL code! Amazon Aurora uses this strategy. Other distributed SQL databases do not support most of these features. Building these features from ground-up is: ● Hard to build robust functional spec ● Takes a lot of time to implement ● Takes even longer for users to adopt and mature

Slide 23

Slide 23 text

© 2020 - All Rights Reserved Layered Architecture DocDB Storage Layer Distributed, transactional document store with sync and async replication support YSQL A fully PostgreSQL compatible relational API YCQL Cassandra compatible semi-relational API Extensible Query Layer Extensible query layer to support multiple API’s Microservice requiring relational integrity Microservice requiring massive scale Microservice requiring geo-distribution of data Extensible query layer ○ YSQL: PostgreSQL-based ○ YCQL: Cassandra-based Transactional storage layer ○ Transactional ACID compliant ○ Resilient and scalable ○ Document storage

Slide 24

Slide 24 text

© 2020 - All Rights Reserved 1. Single Region, Multi-Zone Availability Zone 1 Availability Zone 2 Availability Zone 3 Consistent Across Zones No WAN Latency But No Region-Level Failover/Repair 2. Single Cloud, Multi-Region Region 1 Region 2 Region 3 Consistent Across Regions with Auto Region-Level Failover/Repair 3. Multi-Cloud, Multi-Region Cloud 1 Cloud 2 Cloud 3 Consistent Across Clouds with Auto Cloud-Level Failover/Repair Resilient and strongly consistent across failure domains

Slide 25

Slide 25 text

Slide 26

Slide 26 text

© 2020 - All Rights Reserved YugabyteDB Deployed as StatefulSets 26 node2 node1 node4 node3 yb-master StatefulSet yugabytedb yb-master-1 pod yugabytedb yb-master-0 pod yugabytedb yb-master-2 pod yb-tserver StatefulSet tablet 1’ yugabytedb yb-tserver-1 pod tablet 1’ yugabytedb yb-tserver-0 pod tablet 1’ yugabytedb yb-tserver-3 pod tablet 1’ yugabytedb yb-tserver-2 pod … Local/Remote Persistent Volume Local/Remote Persistent Volume Local/Remote Persistent Volume Local/Remote Persistent Volume yb-masters Headless Service yb-tservers Headless Service App Clients Admin Clients

Slide 27

Slide 27 text

© 2020 - All Rights Reserved Under the Hood – 3 Node Cluster 27 DocDB Storage Engine Purpose-built for ever-growing data, extended from RocksDB yb-master1 yb-master3 yb-master2 YB-Master Manage shard metadata & coordinate cluster-wide ops Worker node1 Worker node3 Worker node2 Global Transaction Manager Tracks ACID txns across multi-row ops, incl. clock skew mgmt. Raft Consensus Replication Highly resilient, used for both data replication & leader election tablet 1’ tablet 1’ yb-tserver1 yb-tserver2 yb-tserver3 tablet 1’ tablet2-leader tablet3-leader tablet1-leader YB-TServer Stores/serves data in/from tablets (shards) tablet1-follower tablet1-follower tablet3-follower tablet2-follower tablet3-follower tablet2-follower … … … YB Helm Charts at charts.yugabyte.com

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

© 2020 - All Rights Reserved YugabyteDB Universe on 3 GKE Clusters Deployment: 3 GKE clusters Each with 3 x N1 Standard 8 nodes 3 pods in each cluster using 4 cores Cores: 4 cores per pod Memory: 7.5 GB per pod Disk: ~ 500 GB total for universe 30

Slide 31

Slide 31 text

Slide 32

Slide 32 text

© 2020 - All Rights Reserved “Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure and declarative APIs exemplify this approach. These techniques enable loosely coupled systems that are resilient, manageable and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.” Cloud Native - cncf.io deﬁnition

Slide 33

Slide 33 text

© 2021 All Rights Reserved ● Database Reliability Engineering ○ Inspired by Google’s SRE model ○ Blending DevOps culture with DBA teams ○ Infrastructure as code ○ Automation is the key Introducing DBRE model

Slide 34

Slide 34 text

© 2021 All Rights Reserved ● Responsibility of the data shared by cross-functional teams ● Provide patterns and knowledge to support other team’s processes to facilitate their work ● Deﬁning reference architectures and conﬁgurations for data stores that are approved for operations, and can be deployed by teams. DBRE Guiding Principles

Slide 35

Slide 35 text

© 2020 - All Rights Reserved 35 YugabyteDB on K8s Multi-Region Requirements ● Pod to pod communication over TCP ports using RPC calls across n K8s clusters ● Global DNS Resolution system ○ Across all the K8s clusters so that pods in one cluster can connect to pods in other clusters ● Ability to create load balancers in each region/DB ● RBAC: ClusterRole and ClusterRoleBinding ● Reference: Deploy YugabyteDB on multi cluster GKE https://docs.yugabyte.com/latest/deploy/kubernetes/multi-cluster/gke/helm-chart/

Slide 36

Slide 36 text

© 2020 - All Rights Reserved 36 Ensuring High Performance LOCAL STORAGE REMOTE STORAGE Lower latency, Higher throughput Recommended for workloads that do their own replication Pre-provision outside of K8s Use SSDs for latency-sensitive apps Higher latency, Lower throughput Recommended for workloads do not perform any replication on their own Provision dynamically in K8s Use alongside local storage for cost-efﬁcient tiering Most used

Slide 37

Slide 37 text

© 2020 - All Rights Reserved 37 Conﬁguring Data Resilience POD ANTI-AFFINITY MULTI-ZONE/REGIONAL/MULTI-REGION POD SCHEDULING Pods of the same type should not be scheduled on the same node Keeps impact of node failures to absolute minimum Multi-Zone – Tolerate zone failures for K8s worker nodes Regional – Tolerate zone failures for both K8s worker and master nodes Multi-Region / Multi-Cluster – Requires network discovery between multi cluster

Slide 38

Slide 38 text

© 2020 - All Rights Reserved 38 BACKUP & RESTORE Backups and restores are a database level construct YugabyteDB can perform distributed snapshot and copy to a target for a backup Restore the backup into an existing cluster or a new cluster with a different number of TServers ROLLING UPGRADES Supports two upgradeStrategies: onDelete (default) and rollingUpgrade Pick rolling upgrade strategy for DBs that support zero downtime upgrades such as YugabyteDB New instance of the pod spawned with same network id and storage HANDLING FAILURES Pod failure handled by K8s automatically Node failure has to be handled manually by adding a new slave node to K8s cluster Local storage failure has to be handled manually by mounting new local volume to K8s Automating Day 2 Operations

Slide 39

Slide 39 text

© 2020 - All Rights Reserved 39 https://github.com/yugabyte/yugabyte-platform-operator Based on Custom Controllers that have direct access to lower level K8S API Excellent ﬁt for stateful apps requiring human operational knowledge to correctly scale, reconﬁgure and upgrade while simultaneously ensuring high performance and data resilience Complementary to Helm for packaging Auto-scaling with k8s operators CPU usage in the yb-tserver StatefulSet Scale pods CPU > 80% for 1min and max_threshold not exceeded

Slide 40

Slide 40 text

© 2020 All Rights Reserved Performance parity across Kubernetes and VMs - TPCC workloads 40 TPCC Workload VMs (AWS) Kubernetes (GKE) Topology 3 x c5.4xlarge( 16 vCPUs, 32 GiB RAM, 400 GB SSD) 3 x TServer Pods (16 vCPUs, 15 GB RAM, 400 GB SSD) tpmC 12,597.63 12,299.60 Efficiency 97.96% 95.64% Throughput 469.06 requests/sec 462.32 requests/sec Latency New Order Avg Latency: 33.313 ms P99 Latency: 115.446 ms Payment Avg Latency: 24.735 ms P99 Latency: 86.051 ms OrderStatus Avg Latency: 14.357 ms P99 Latency: 43.475 ms Delivery Avg Latency: 66.522 ms P99 Latency: 205.065 ms StockLevel Avg Latency: 212.180 ms P99 Latency: 670.487 ms New Order Avg Latency: 59.66 ms P99 Latency: 478.89 ms Payment Avg Latency: 33.53 ms P99 Latency: 248.07 ms OrderStatus Avg Latency: 14.65 ms P99 Latency: 100.48 ms Delivery Avg Latency: 148.36 ms P99 Latency: 838.42 ms StockLevel Avg Latency: 99.99 ms P99 Latency: 315.38 ms

Slide 41

Slide 41 text

© 2021 All Rights Reserved Target Use Cases 41 Systems of Record and Engagement Event Data and IoT Geo-Distributed Workloads Resilient, business critical data Handling massive scale Sync, async, and geo replication ● Identity management ● User/account proﬁle ● eCommerce apps - checkout, shopping cart ● Real time payment systems ● Vehicle telemetry ● Stock bids and asks ● Shipment information ● Credit card transactions ● Vehicle telemetry ● Stock bids and asks ● Shipment information ● Credit card transactions

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

© 2020 - All Rights Reserved Istio Trafﬁc Management for Microservices 46 CART MICROSERVICE PRODUCT MICROSERVICE API Gateway CHECKOUT MICROSERVICE UIU UI APP Galley Citadel Pilot Istio Edge Proxy Istio Control Plane Istio Service Discovery Istio Edge Gateway Istio Route Configuration using Envoy Proxy

Slide 47

Slide 47 text

© 2020 - All Rights Reserved The fastest growing Distributed SQL Database Slack users ▲ 3K We 💛 stars! Give us one: github.com/YugaByte/yugabyte-db Join our community: yugabyte.com/slack Clusters deployed ▲ 600K