Slide 1

Slide 1 text

YugabyteDB Operational Best Practices Amey Banarse Principal Data Architect, Yugabyte

Slide 2

Slide 2 text

2 © 2019 All rights reserved. Introduction 2  Amey Banarse Principal Data Architect, YugabyteDB ♦ Pivotal ♦ FINRA University of Pennsylvania @ameybanarse http://about.me/amey

Slide 3

Slide 3 text

© 2020 All Rights Reserved PostgreSQL-compatible, high-performance, open-source, cloud-native distributed SQL database 100% Apache 2.0 Low Latency & High Throughput Built for Kubernetes & Cloud Native Ecosystem

Slide 4

Slide 4 text

4 Yugabyte Confidential © 2019 All rights reserved. ● Database Reliability Engineering ○ Inspired by Google’s SRE model ○ Blending DevOps culture with DBA teams ○ Infrastructure as code ○ Automation is the key Introducing DBRE model

Slide 5

Slide 5 text

5 Yugabyte Confidential © 2019 All rights reserved. ● Responsibility of the data shared by cross-functional teams ● Provide patterns and knowledge to support other team’s processes to facilitate their work ● Defining reference architectures and configurations for data stores that are approved for operations, and can be deployed by teams. DBRE Guiding Principles

Slide 6

Slide 6 text

6 Yugabyte Confidential © 2019 All rights reserved. “Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure and declarative APIs exemplify this approach. These techniques enable loosely coupled systems that are resilient, manageable and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.” Cloud Native - cncf.io definition

Slide 7

Slide 7 text

7 Yugabyte Confidential © 2019 All rights reserved. Designed for Cloud Native Microservices Sharding & Load Balancing Raft Consensus Replication Distributed Transaction Manager & MVCC Document Storage Layer Custom RocksDB Storage Engine DocDB Distributed Document Store Yugabyte Query Layer YCQL YSQL PostgreSQL Google Spanner YugabyteDB SQL Ecosystem ✓ Massively adopted ✘ New SQL flavor ✓ Reuse PostgreSQL RDBMS Features ✓ Advanced Complex ✘ Basic cloud-native ✓ Advanced Complex and cloud-native Highly Available ✘ ✓ ✓ Horizontal Scale ✘ ✓ ✓ Distributed Txns ✘ ✓ ✓ Data Replication Async Sync Sync + Async

Slide 8

Slide 8 text

© 2020 All Rights Reserved Yugabyte Cloud Fully Managed DBaaS Yugabyte offers flexible consumption models 24 x 7 Enterprise Support Yugabyte Platform UI Operational Excellence DBaaS out of the box Yugabyte Platform YugabyteDB Self managed Self Service UI Yugabyte managed Public DBaaS Community supported Yugabyte DB Transactional Distributed SQL DB 100% Open Source Apache 2.0 PostgreSQL compatible Enterprise grade RDBMS Cloud Native Self or Yugabyte managed https://download.yugabyte.com

Slide 9

Slide 9 text

9 Yugabyte Confidential © 2019 All rights reserved. ● Comprehensive operations lifecycle manager ● Cloud Native deployments on public & private cloud ● Robust automation for Day 2 ops with self healing ● Built-in monitoring and alerts ● Configuration & Change Management Yugabyte Platform

Slide 10

Slide 10 text

10 Yugabyte Confidential © 2019 All rights reserved. ● Choosing the right topology for business ○ Business Continuity or data regulations ○ Single or Multi Regional cluster ● Multi & Hybrid Cloud deployments ● Self Service deployment on any form factor - Containers, Virtual Machine(VM), Bare Metal ● Ability to expand to new Regions Self-Service Deployment

Slide 11

Slide 11 text

11 Yugabyte Confidential © 2019 All rights reserved. 1. Single Region, Multi-Zone Availability Zone 1 Availability Zone 2 Availability Zone 3 Consistent Across Zones No WAN Latency But No Region-Level Failover/Repair 2. Single Cloud, Multi-Region Region 1 Region 2 Region 3 Consistent Across Regions with Auto Region-Level Failover/Repair 3. Multi-Cloud, Multi-Region Cloud 1 Cloud 2 Cloud 3 Consistent Across Clouds with Auto Cloud-Level Failover/Repair Resilient and strongly consistent across failure domains

Slide 12

Slide 12 text

12 © 2018 All rights reserved. ● On AWS, 7 Production clusters, 3 are 15 Nodes each. ~4 non-Prod. ● Deployed across 3 AZs, 5 Nodes per AZ, rf=3. Can handle AZ failures. ● In Prod, they did zero downtime upgrade of YB from v1.1.9 to 1.2.6 ● ~400 GB compressed data per node = 1 TB per node ● Write-heavy clusters, 60K to 80K writes/sec at peak with CPU utilization < 20%. YugabyteDB Customers & Sample Deployment Topologies Customer A ● 18-node Production cluster with 2 TB of compressed data per node. ● 300+ days of uninterrupted uptime and availability ● Zero-downtime node repair ● Cluster expansion with zero-downtime and in minutes Customer B

Slide 13

Slide 13 text

13 Yugabyte Confidential © 2019 All rights reserved. ● Changes incorporated into deployment and infrastructure automation, with focus on testing, fallback, and impact mitigation ● Rolling Upgrades, Scale Up/Down & CI tools ● Self healing ● Operational Efficiency with minimal toil Automation

Slide 14

Slide 14 text

14 Yugabyte Confidential © 2019 All rights reserved. ● Standardized and automated backup and recovery processes ● Business Continuity and Disaster Recovery ○ Defining RPO and RTO ● Defining the right backup strategy with data validation pipelines Data Recoverability

Slide 15

Slide 15 text

15 Yugabyte Confidential © 2019 All rights reserved. Data Recoverability ● Auto Failover ● Resilient to AZ/Rack outages ● No downtime for client apps

Slide 16

Slide 16 text

16 Yugabyte Confidential © 2019 All rights reserved. ● IaaS Configurations ● Centralized Store for Cluster Config ● Security - KMS or SmartKey integration ● Backup/Restore - Object Store or NFS Config Management

Slide 17

Slide 17 text

17 Yugabyte Confidential © 2019 All rights reserved. ● Monitoring KPI(s) on various levels ○ Node, Network, Memory usage & Storage layer ● Build Alerting and integration w/ Incident Response platforms like PagerDuty ● Automated health-checks for regular activities Monitoring & Health Checks

Slide 18

Slide 18 text

18 Yugabyte Confidential © 2019 All rights reserved. Enterprise Security Hardening 18 • Third-party security audit • Audit Logging for DB operations • YSQL Kerberos and GSSAPI (in development) • LDAP support for better user management • YSQL support is GA since YB v2.4 • YCQL support will be available in YB v2.6.x • Data in Transit Encryption Enhancements • TLS v1.2 across the database • Simplified TLS workflows in YB Platform • Data at Rest Encryption with Vormetric TDE and AWS KMS Intg.

Slide 19

Slide 19 text

19 Yugabyte Confidential © 2019 All rights reserved. Case Study

Slide 20

Slide 20 text

20 Yugabyte Confidential © 2019 All rights reserved. ● 27 billion operations per day with 30-40 ms of latency ● 35 TBs of data managed by a single cluster ● Rolling upgrades with zero downtime ● Less time spent managing databases and more time spent on core business ● YugabyteDB can support Plume’s next phase of growth which is predicted to reach 75+ billion ops per day Plume Design

Slide 21

Slide 21 text

21 Yugabyte Confidential © 2019 All rights reserved. Demo