Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, and Azure Serverless | Citus Con 2023 | Lucas Borges Fernandes

Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, and Azure Serverless | Citus Con 2023 | Lucas Borges Fernandes

Azure Cosmos DB for PostgreSQL is a managed service offering that is powered by the open-source Citus database extension to Postgres. It has many features to help run enterprise-ready applications. One of the top Citus features is the ability to run PostgreSQL at any scale, on a single node as well as a distributed database cluster.

As your application needs to scale, you can add more nodes to the Azure Cosmos DB for PostgreSQL cluster, rebalance existing data to the new Postgres nodes, and seamlessly scale out. However, these operations require manual intervention: a) first you must create alerts on metrics, b) then, based on those alerts, you need to add more nodes, c) then you must kick off and monitor the shard rebalancer. Automating these steps will give you a complete auto scale experience—and make your life so much easier.

Let us learn how to “auto scale” by using existing tools such as Grafana & Azure Serverless.

Configuring Grafana to capture metrics from Azure Cosmos DB for PostgreSQL
Configure alerts based on those metrics
Trigger Azure Serverless function using Grafana alerts to automate the process of scaling out, both automatic node addition and shard rebalancer maintenance
As icing on the cake, we will also demonstrate automatic tenant isolation!

More Decks by Azure Database for PostgreSQL

Other Decks in Technology

Transcript

  1. Lucas Borges Software Engineer on PostgreSQL & Citus team @Microsoft

    Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, & Azure Serverless
  2. • Lucas Borges Fernandes • Customer Engineer in the Azure

    Cosmos DB for PostgreSQL team • Helping customers adopt Citus & PostgreSQL • MSc in Distributed Systems (Plug-and-play SMR for apps in K8s)
  3. Agenda 1. Why auto scaling? 2. Auto scaling a Citus

    cluster 3. Grafana 4. Azure Functions (Serverless) 5. Azure Resource Manager (ARM) 6. Demo scenarios 7. Demo presentation 8. Key takeaways 9. Useful docs
  4. Why auto scaling? • Applications can experience spikes in usage

    • Human intervention may not be the most time effective option • Can prevent errors proactively 80% CPU 75% CPU Coordinator App Add node & rebalance shards High load
  5. Auto scaling a Citus cluster Azure Cosmos DB for PostgreSQL

    cluster Azure Monitor & PostgreSQL connectors Alerts triggering Azure Functions (CPU, New workers, Tenants) Node/Typescript ARM SDK ARM API to update cluster Shard rebalancer & tenant isolation Auto scaler metadata
  6. Grafana • Data sources • Azure Monitor • PostgreSQL •

    Time-series metric on workers count • Custom table and SQL (pg_cron & pg_dist_node) • Alerts on • CPU percentage --> Add workers' function • Worker count --> Rebalance shards function • Tenant disk usage --> Isolate tenant function
  7. Workers' count table • Query Citus metadata to get number

    of workers over time • pg_cron extension to insert new records every 5 seconds • Used primarily to feed timeseries panel on Grafana + alert • When new worker is added, count increases and rebalancer is triggered from Azure function SELECT cron.schedule( 'worker-count', '5 seconds', $$ INSERT INTO worker_count(created_at, count) SELECT now(), (SELECT count(*) as worker_count FROM pg_dist_node WHERE nodename ilike 'private-w%' LIMIT 1); $$);
  8. Azure Functions • Written in Node/Typescript • Azure Functions Core

    Tools to create boilerplate, test locally & deploy functions • func init autoscale-cdbpg --typescript • func new --name cpu_spike --template "HTTP trigger" • func azure functionapp publish autoscale-cdbpg • ARM JS/TS APIs to add new workers, Redis to manage state & PostgreSQL library to connect & execute commands • Triggered on demand by Grafana alerts
  9. Azure Resource Manager (ARM) • Deployment and management service for

    Azure • Used to create and update Azure services, such as a Cosmos DB for PostgreSQL cluster • Can be used through Azure portal or APIs (REST, SDKs) • Typescript SDK API was used in this project
  10. ARM template example for Azure Cosmos DB for PostgreSQL ...

    "coordinatorVCores": 32, "enableHa": true, "enableShardsOnCoordinator": true, "nodeCount": 2, "nodeEnablePublicIpAccess": true, "nodeServerEdition": "GeneralPurpose", "nodeStorageQuotaInMb": 524288, "nodeVCores": 32, "postgresqlVersion": "15" } } { "type": "Microsoft.DBforPostgreSQL/serverGroupsv2", "apiVersion": "2022-11-08", "name": "<cluster-name>", "location": "eastus", "properties": { "administratorLoginPassword": "<password>", "citusVersion": "11.2", "coordinatorEnablePublicIpAccess": true, "coordinatorServerEdition": "GeneralPurpose", "coordinatorStorageQuotaInMb": 524288, ...
  11. ARM SDK API for Azure Cosmos DB for PostgreSQL •

    Easy to use APIs • Authentication using AAD service principals import { DefaultAzureCredential } from "@azure/identity" import { Deployment, ResourceManagementClient} from "@azure/arm-resources"; const credential = new DefaultAzureCredential(); const azureSubscriptionId = process.env.AZURE_SUBSCRIPTION_ID; const client = new ResourceManagementClient(credential, azureSubscriptionId); await client.deployments.beginCreateOrUpdate(resource_group, deploymentName, deployment);
  12. Demo scenario #1 • pgbench to increase load in a

    2 workers' cluster • 32 vCores / 128 GiB RAM coordinator • 16 vcores / 128 GiB RAM workers • PostgreSQL v15 / Citus v11.2 • 2 queries with joins and aggregations • CPU > 80% will trigger alert to increase workers count by 2 • New nodes metric will trigger alert to rebalance shards
  13. Demo scenario #2 • pgbench to increase single tenant disk

    usage • 32 vCores / 128 GiB RAM coordinator • 16 vcores / 128 GiB RAM workers • PostgreSQL v15 / Citus v11.2 • 1 query inserting on table with hardcoded distribution key • Single tenant representing > 50% of shard size will trigger alert to isolate tenant & rebalance shards based on disk usage
  14. Key takeaways 1. Grafana + Azure Functions can be used

    to implement any event-based action 2. Reduced latency from 150ms to 100ms 3. Azure Cosmos DB for PostgreSQL (Citus on Azure) can be used with regular PostgreSQL data source connector 4. Grafana + Citus metadata can give powerful insights about the distributed Postgres cluster
  15. Useful Docs links to learn more... • GitHub repo for

    open source Citus database extension (github.com) • Docs: Overview of Azure Cosmos DB for PostgreSQL (Citus on Azure) • Docs: What is Azure Managed Grafana? • Docs: Azure Functions Overview • Docs: Azure Resource Manager (ARM) overview • Docs & Tutorial: Azure SDK for Javascript/Typescript