Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, and Azure Serverless | Citus Con 2023 | Lucas Borges Fernandes

Auto scaling Azure Cosmos DB for PostgreSQL with Citus, Grafana, and Azure Serverless | Citus Con 2023 | Lucas Borges Fernandes

Azure Cosmos DB for PostgreSQL is a managed service offering that is powered by the open-source Citus database extension to Postgres. It has many features to help run enterprise-ready applications. One of the top Citus features is the ability to run PostgreSQL at any scale, on a single node as well as a distributed database cluster.

As your application needs to scale, you can add more nodes to the Azure Cosmos DB for PostgreSQL cluster, rebalance existing data to the new Postgres nodes, and seamlessly scale out. However, these operations require manual intervention: a) first you must create alerts on metrics, b) then, based on those alerts, you need to add more nodes, c) then you must kick off and monitor the shard rebalancer. Automating these steps will give you a complete auto scale experience—and make your life so much easier.

Let us learn how to “auto scale” by using existing tools such as Grafana & Azure Serverless.

Configuring Grafana to capture metrics from Azure Cosmos DB for PostgreSQL
Configure alerts based on those metrics
Trigger Azure Serverless function using Grafana alerts to automate the process of scaling out, both automatic node addition and shard rebalancer maintenance
As icing on the cake, we will also demonstrate automatic tenant isolation!

More Decks by Azure Database for PostgreSQL

Other Decks in Technology

Transcript

  1. Lucas Borges
    Software Engineer on
    PostgreSQL & Citus team @Microsoft
    Auto scaling Azure
    Cosmos DB for PostgreSQL
    with Citus, Grafana, &
    Azure Serverless

    View Slide

  2. • Lucas Borges Fernandes
    • Customer Engineer in the Azure
    Cosmos DB for PostgreSQL team
    • Helping customers adopt
    Citus & PostgreSQL
    • MSc in Distributed Systems
    (Plug-and-play SMR for apps in K8s)

    View Slide

  3. Agenda
    1. Why auto scaling?
    2. Auto scaling a Citus cluster
    3. Grafana
    4. Azure Functions (Serverless)
    5. Azure Resource Manager (ARM)
    6. Demo scenarios
    7. Demo presentation
    8. Key takeaways
    9. Useful docs

    View Slide

  4. Azure Cosmos DB
    for PostgreSQL =
    Azure +
    PostgreSQL +
    Citus open source

    View Slide

  5. Why auto scaling?
    • Applications can
    experience spikes in usage
    • Human intervention
    may not be the most time
    effective option
    • Can prevent errors
    proactively
    80% CPU
    75% CPU
    Coordinator
    App
    Add node &
    rebalance
    shards
    High load

    View Slide

  6. Auto scaling a Citus cluster
    Azure Cosmos DB
    for PostgreSQL
    cluster
    Azure Monitor &
    PostgreSQL connectors
    Alerts triggering Azure Functions
    (CPU, New workers, Tenants)
    Node/Typescript
    ARM SDK
    ARM API to
    update cluster
    Shard rebalancer & tenant isolation
    Auto scaler
    metadata

    View Slide

  7. Grafana
    • Data sources
    • Azure Monitor
    • PostgreSQL
    • Time-series metric on workers count
    • Custom table and SQL (pg_cron & pg_dist_node)
    • Alerts on
    • CPU percentage --> Add workers' function
    • Worker count --> Rebalance shards function
    • Tenant disk usage --> Isolate tenant function

    View Slide

  8. Azure Monitor (lucasbo)
    cituscon-cluster-04

    View Slide

  9. Workers' count table
    • Query Citus metadata to get
    number of workers over time
    • pg_cron extension to insert new
    records every 5 seconds
    • Used primarily to feed timeseries
    panel on Grafana + alert
    • When new worker is added,
    count increases and rebalancer is
    triggered from Azure function
    SELECT cron.schedule( 'worker-count', '5 seconds', $$
    INSERT INTO worker_count(created_at, count)
    SELECT
    now(),
    (SELECT
    count(*) as worker_count
    FROM
    pg_dist_node
    WHERE
    nodename ilike 'private-w%' LIMIT 1);
    $$);

    View Slide

  10. View Slide

  11. View Slide

  12. Azure
    Functions
    • Written in Node/Typescript
    • Azure Functions Core Tools to create boilerplate, test
    locally & deploy functions
    • func init autoscale-cdbpg --typescript
    • func new --name cpu_spike --template "HTTP trigger"
    • func azure functionapp publish autoscale-cdbpg
    • ARM JS/TS APIs to add new workers, Redis to manage
    state & PostgreSQL library to connect & execute
    commands
    • Triggered on demand by Grafana alerts

    View Slide

  13. Azure Function Example

    View Slide

  14. Azure
    Resource
    Manager
    (ARM)
    • Deployment and management service for Azure
    • Used to create and update Azure services, such as
    a Cosmos DB for PostgreSQL cluster
    • Can be used through Azure portal or APIs (REST, SDKs)
    • Typescript SDK API was used in this project

    View Slide

  15. ARM template example for
    Azure Cosmos DB for PostgreSQL
    ...
    "coordinatorVCores": 32,
    "enableHa": true,
    "enableShardsOnCoordinator": true,
    "nodeCount": 2,
    "nodeEnablePublicIpAccess": true,
    "nodeServerEdition": "GeneralPurpose",
    "nodeStorageQuotaInMb": 524288,
    "nodeVCores": 32,
    "postgresqlVersion": "15"
    } }
    { "type":
    "Microsoft.DBforPostgreSQL/serverGroupsv2",
    "apiVersion": "2022-11-08",
    "name": "",
    "location": "eastus",
    "properties": {
    "administratorLoginPassword": "",
    "citusVersion": "11.2",
    "coordinatorEnablePublicIpAccess": true,
    "coordinatorServerEdition": "GeneralPurpose",
    "coordinatorStorageQuotaInMb": 524288,
    ...

    View Slide

  16. ARM SDK API for
    Azure Cosmos DB for PostgreSQL
    • Easy to use
    APIs
    • Authentication
    using
    AAD service
    principals
    import { DefaultAzureCredential } from "@azure/identity"
    import { Deployment, ResourceManagementClient} from "@azure/arm-resources";
    const credential = new DefaultAzureCredential();
    const azureSubscriptionId = process.env.AZURE_SUBSCRIPTION_ID;
    const client = new ResourceManagementClient(credential, azureSubscriptionId);
    await client.deployments.beginCreateOrUpdate(resource_group, deploymentName,
    deployment);

    View Slide

  17. Demo scenario #1
    • pgbench to increase load in a 2 workers' cluster
    • 32 vCores / 128 GiB RAM coordinator
    • 16 vcores / 128 GiB RAM workers
    • PostgreSQL v15 / Citus v11.2
    • 2 queries with joins and aggregations
    • CPU > 80% will trigger alert to increase workers count by 2
    • New nodes metric will trigger alert to rebalance shards

    View Slide

  18. Demo scenario #2
    • pgbench to increase single tenant disk usage
    • 32 vCores / 128 GiB RAM coordinator
    • 16 vcores / 128 GiB RAM workers
    • PostgreSQL v15 / Citus v11.2
    • 1 query inserting on table with hardcoded distribution key
    • Single tenant representing > 50% of shard size will trigger alert to isolate tenant &
    rebalance shards based on disk usage

    View Slide

  19. Demo

    View Slide

  20. Key takeaways
    1. Grafana + Azure Functions can be used to implement any event-based
    action
    2. Reduced latency from 150ms to 100ms
    3. Azure Cosmos DB for PostgreSQL (Citus on Azure) can be used with
    regular PostgreSQL data source connector
    4. Grafana + Citus metadata can give powerful insights about the distributed
    Postgres cluster

    View Slide

  21. Useful Docs links to learn more...
    • GitHub repo for open source Citus database extension (github.com)
    • Docs: Overview of Azure Cosmos DB for PostgreSQL (Citus on Azure)
    • Docs: What is Azure Managed Grafana?
    • Docs: Azure Functions Overview
    • Docs: Azure Resource Manager (ARM) overview
    • Docs & Tutorial: Azure SDK for Javascript/Typescript

    View Slide