Slide 1

Slide 1 text

Lucas Borges Software Engineer on PostgreSQL & Citus team @Microsoft Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, & Azure Serverless

Slide 2

Slide 2 text

• Lucas Borges Fernandes • Customer Engineer in the Azure Cosmos DB for PostgreSQL team • Helping customers adopt Citus & PostgreSQL • MSc in Distributed Systems (Plug-and-play SMR for apps in K8s)

Slide 3

Slide 3 text

Agenda 1. Why auto scaling? 2. Auto scaling a Citus cluster 3. Grafana 4. Azure Functions (Serverless) 5. Azure Resource Manager (ARM) 6. Demo scenarios 7. Demo presentation 8. Key takeaways 9. Useful docs

Slide 4

Slide 4 text

Azure Cosmos DB for PostgreSQL = Azure + PostgreSQL + Citus open source

Slide 5

Slide 5 text

Why auto scaling? • Applications can experience spikes in usage • Human intervention may not be the most time effective option • Can prevent errors proactively 80% CPU 75% CPU Coordinator App Add node & rebalance shards High load

Slide 6

Slide 6 text

Auto scaling a Citus cluster Azure Cosmos DB for PostgreSQL cluster Azure Monitor & PostgreSQL connectors Alerts triggering Azure Functions (CPU, New workers, Tenants) Node/Typescript ARM SDK ARM API to update cluster Shard rebalancer & tenant isolation Auto scaler metadata

Slide 7

Slide 7 text

Grafana • Data sources • Azure Monitor • PostgreSQL • Time-series metric on workers count • Custom table and SQL (pg_cron & pg_dist_node) • Alerts on • CPU percentage --> Add workers' function • Worker count --> Rebalance shards function • Tenant disk usage --> Isolate tenant function

Slide 8

Slide 8 text

Azure Monitor (lucasbo) cituscon-cluster-04

Slide 9

Slide 9 text

Workers' count table • Query Citus metadata to get number of workers over time • pg_cron extension to insert new records every 5 seconds • Used primarily to feed timeseries panel on Grafana + alert • When new worker is added, count increases and rebalancer is triggered from Azure function SELECT cron.schedule( 'worker-count', '5 seconds', $$ INSERT INTO worker_count(created_at, count) SELECT now(), (SELECT count(*) as worker_count FROM pg_dist_node WHERE nodename ilike 'private-w%' LIMIT 1); $$);

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Azure Functions • Written in Node/Typescript • Azure Functions Core Tools to create boilerplate, test locally & deploy functions • func init autoscale-cdbpg --typescript • func new --name cpu_spike --template "HTTP trigger" • func azure functionapp publish autoscale-cdbpg • ARM JS/TS APIs to add new workers, Redis to manage state & PostgreSQL library to connect & execute commands • Triggered on demand by Grafana alerts

Slide 13

Slide 13 text

Azure Function Example

Slide 14

Slide 14 text

Azure Resource Manager (ARM) • Deployment and management service for Azure • Used to create and update Azure services, such as a Cosmos DB for PostgreSQL cluster • Can be used through Azure portal or APIs (REST, SDKs) • Typescript SDK API was used in this project

Slide 15

Slide 15 text

ARM template example for Azure Cosmos DB for PostgreSQL ... "coordinatorVCores": 32, "enableHa": true, "enableShardsOnCoordinator": true, "nodeCount": 2, "nodeEnablePublicIpAccess": true, "nodeServerEdition": "GeneralPurpose", "nodeStorageQuotaInMb": 524288, "nodeVCores": 32, "postgresqlVersion": "15" } } { "type": "Microsoft.DBforPostgreSQL/serverGroupsv2", "apiVersion": "2022-11-08", "name": "", "location": "eastus", "properties": { "administratorLoginPassword": "", "citusVersion": "11.2", "coordinatorEnablePublicIpAccess": true, "coordinatorServerEdition": "GeneralPurpose", "coordinatorStorageQuotaInMb": 524288, ...

Slide 16

Slide 16 text

ARM SDK API for Azure Cosmos DB for PostgreSQL • Easy to use APIs • Authentication using AAD service principals import { DefaultAzureCredential } from "@azure/identity" import { Deployment, ResourceManagementClient} from "@azure/arm-resources"; const credential = new DefaultAzureCredential(); const azureSubscriptionId = process.env.AZURE_SUBSCRIPTION_ID; const client = new ResourceManagementClient(credential, azureSubscriptionId); await client.deployments.beginCreateOrUpdate(resource_group, deploymentName, deployment);

Slide 17

Slide 17 text

Demo scenario #1 • pgbench to increase load in a 2 workers' cluster • 32 vCores / 128 GiB RAM coordinator • 16 vcores / 128 GiB RAM workers • PostgreSQL v15 / Citus v11.2 • 2 queries with joins and aggregations • CPU > 80% will trigger alert to increase workers count by 2 • New nodes metric will trigger alert to rebalance shards

Slide 18

Slide 18 text

Demo scenario #2 • pgbench to increase single tenant disk usage • 32 vCores / 128 GiB RAM coordinator • 16 vcores / 128 GiB RAM workers • PostgreSQL v15 / Citus v11.2 • 1 query inserting on table with hardcoded distribution key • Single tenant representing > 50% of shard size will trigger alert to isolate tenant & rebalance shards based on disk usage

Slide 19

Slide 19 text

Demo

Slide 20

Slide 20 text

Key takeaways 1. Grafana + Azure Functions can be used to implement any event-based action 2. Reduced latency from 150ms to 100ms 3. Azure Cosmos DB for PostgreSQL (Citus on Azure) can be used with regular PostgreSQL data source connector 4. Grafana + Citus metadata can give powerful insights about the distributed Postgres cluster

Slide 21

Slide 21 text

Useful Docs links to learn more... • GitHub repo for open source Citus database extension (github.com) • Docs: Overview of Azure Cosmos DB for PostgreSQL (Citus on Azure) • Docs: What is Azure Managed Grafana? • Docs: Azure Functions Overview • Docs: Azure Resource Manager (ARM) overview • Docs & Tutorial: Azure SDK for Javascript/Typescript