Citus from the Customer eyes | Citus Con: An Event for Postgres 2023 | Sai Srirampur

Citus from the Customer Eyes Sai Srirampur Engineering Manager, Customer
Success for Postgres on Azure

This talk is for 1. SaaS companies looking to scale
with distributed PostgreSQL. 100s 10s 1000s

This talk is for 2. Companies who want to build
scalable Timeseries apps

Customer Journey - 4 steps 1. What is Citus? 2.
When to use Citus? 3. Data-modeling with Citus 4. Running app in production with Citus

What is Citus?

Postgres At Any Scale

Superpower of ”distributed tables” to scale-out Postgres Start Building apps
on Single node Seamlessly scale to multiple nodes by distributing tables

Packaged as extension, not a fork Postgres At Any Scale
Open source & Fully managed

Packaged as extension, not a fork Postgres At Any Scale
Open source & Fully managed Open source & fully-managed

Global replication Security Distributed Scale High Availability Compliance Enterprise-grade monitoring
Easy ecosystem integration Extension support Extensions JSONB Full text search Geospatial support Rich indexing Powered by fully Open Source Citus Extension to Postgres Azure Cosmos DB for PostgreSQL Fully managed D-SQL offering built upon core benefits of PostgreSQL

When to use Citus? Multi-tenant SaaS Apps Timeseries/IoT Apps

Multi-tenant SaaS Apps – Scenarios • Expecting to outgrow single
node soon. • Complex homegrown sharding to manage tenants • Incurring large tech debt • Inflexible to scale out • Inflexible to rebalance tenants

Scaling multi-tenant SaaS apps with Citus • Automatic sharding of
tenants across nodes • Online scale-out with shard rebalancer • Tenant Isolation • Comprehensive tenant monitoring • Row Level Security per tenant

11K+ Customers 800ms response times for 95% of queries 100
requests per second 20 nodes 10B+ Total #rows 2 clusters in EU and US Reference

Timeseries / IoT apps - Scenarios • Apps providing real-time
insights on telemetry data • Examples: device telemetry, vehicle telemetry, logs, event data etc. • Have implicit scale requirement • Affinity to PostgreSQL

Scaling timeseries apps with Citus • Horizontal scale out -
Start small and scale out to multiple nodes • JSONB to store semi-structured data • Native timeseries APIs for partitioning, expiry and archival.

Reference

Data-modeling with Citus

5 tips for distributed data-modeling with Postgres

TIP 1: Column that adds natural dimension to data is
a good shard key • Represents a central piece of application • Lives at heart of application • Examples: • SaaS apps – customer_id is a good candidate • IoT app – the device_id is a good candidate More details on Distribution column and how Citus distributes data and queries

github_events Event_id Event_type Event_hash Repo id User_id Repo id Payload
Created_at org github_users User_id url login avatar Country_id gravatar Display login User_id is good candidate for distributing tables • Events generated by users themselves • Adds natural dimension & is at center of workload countries Country_id Name Continent Capital Example: App to analyze GitHub events

TIP 2: Co-locate large tables & make small tables reference

Example: App to analyze GitHub events github_events Event_id Event_type Event_hash
Repo id User_id Repo id Payload Created_at org github_users User_id url login avatar Country_id gravatar Display login countries Country_id Name Continent Capital Distributed PostgreSQL tables with co-location: Events & users distributed on user_id SELECT create_distributed_table( 'github_users', 'user_id'); SELECT create_distributed_table( 'github_events', 'user_id'); Reference table: SELECT create_reference_table( 'countries');

Power of Co-locating tables Items-4 Items-4 Items-4 events_4 countries Items-4
Items-4 Items-4 users_4 Items-4 Items-4 Items-4 events_8 Items-4 Items-4 Items-4 users_8 countries SQL pushdown - joins/fkeys Citus Worker 1 Citus Worker 2 Citus Coordinator countries events users

github_events Event_id Event_type Event_hash Repo id User_id Repo id Payload
(JSONB) Created_at org • Simplified schema – single table to store a variety of data • GIN indexing for fast JSONB access • NoSQL capabilities to Postgres world • Helpful in implementing workloads that need scale – Ex IoT, Events, Logs TIP 3: Use JSONB data-type to store unstructured data

TIP 4: Timeseries partitioning on distributed tables github_events Event_id Event_type
User_id Repo id User_id Repo id Payload (JSONB) Created_at org Distribute on user_id Partition on created_at github_events_jan Event_id Event_type User_id … github_events_feb Event_id Event_type User_id … github_events_sept Event_id Event_type User_id … github_events_aug Event_id Event_type User_id … github_events_july Event_id Event_type User_id … . . . . . . . Archiving older partitions using columnar Most recent partitions using row storage – real time ingests • Helps filtered querying with timestamp filters • Efficient expiry – DROP TABLE instead of DELETE • Helps archiving older data through columnar • Common in implementing timeseries workloads

Timeseries APIs for partitioning, archival, & expiry • Automatic Data
Partitioning – Easily Create partitions in past and future SELECT create_time_partitions( table_name:= 'time_series_events', partition_interval:= '1 day’, end_at:= '2021-10-30’, start_from:= '2021-10-10'); • Automatic Data Archival – Convert older partitions from row to columnar CALL alter_old_partitions_set_access_method('time_series_events’, '2021-10-20', 'columnar'); • Automatic Data Expiry - Expire data older than a given time period CALL drop_old_time_partitions( table_name:= 'time_series_events', older_than:= '2021-10-15');

TIP 5: Scope queries to shard key where possible •
Ex: SaaS workloads – queries can be scoped on tenant_id UPDATE ads SET impressions_count = impressions_count+1 WHERE id = 42; TO UPDATE ads SET impressions_count = impressions_count+1 WHERE id = 42 AND company_id = 1; • Avoids unnecessary fan out & helps query latency. • More relevant in transactional workloads

5 tips for distributed data-modeling with Postgres • TIP 1:
Column that adds natural dimension to data is a good shard key • TIP 2: Co-locate large tables, make small tables reference • TIP 3: Use JSONB data-type to store unstructured data • TIP 4: Timeseries partitioning on distributed tables • TIP 5: Scope queries to shard key where possible

Running app in production with Citus

Line item Open Source / Self managed Azure Cosmos DB
for PostgreSQL App Migration Use 5-tips for data-modeling and modify app End-to-end migration guide Data Migration pg_dump/pg_restore, pgcopydb for online migration pg_azure_storage extension: Query and load data directly from blob. ADF, Azure Databricks, Azure Stream Analytics, any integration to PG. Deployment Docker images for dev/test Burstable Compute for dev/test ARM/BICEP templates High Availability Patroni HA for Citus Auto HA with SLAs Upgrades Open Source Docs Self-serve upgrades Disaster Recovery pg_dump/pg_restore pgbackrest, wal-e, wal-g Automatic Cross-AZ Backups + PITR Cross-region read replicas Infra Metrics Self-managed using any PG monitoring tool Azure Monitor Integration PG Bouncer Setup using community docs Pgbouncer support included DB tuning and troubleshooting Performance tuning 1, Performance tuning 2, Useful Diagnostic Queries Security Roles, RLS Roles, RLS, Azure Firewall, Azure Private Endpoint Scale Self-manage addition, removal of infra Use fully OSS shard rebalancer Out-of-box support for vertical and horizontal scale Checklist to run app in production with Citus

DEMO of Azure Cosmos DB for PostgreSQL (aka Citus on
Azure)

@saisrirampur @AzureCosmosDB @citusdata

Citus from the Customer eyes | Citus Con: An Ev...

Citus from the Customer eyes | Citus Con: An Event for Postgres 2023 | Sai Srirampur

Citus Data

More Decks by Citus Data

Other Decks in Technology

Featured

Transcript

Citus from the Customer Eyes Sai Srirampur Engineering Manager, Customer

This talk is for 1. SaaS companies looking to scale

This talk is for 2. Companies who want to build

Customer Journey - 4 steps 1. What is Citus? 2.

What is Citus?

Postgres At Any Scale

Superpower of ”distributed tables” to scale-out Postgres Start Building apps

Packaged as extension, not a fork Postgres At Any Scale

Packaged as extension, not a fork Postgres At Any Scale

Global replication Security Distributed Scale High Availability Compliance Enterprise-grade monitoring

When to use Citus? Multi-tenant SaaS Apps Timeseries/IoT Apps

Multi-tenant SaaS Apps – Scenarios • Expecting to outgrow single

Scaling multi-tenant SaaS apps with Citus • Automatic sharding of

11K+ Customers 800ms response times for 95% of queries 100

Timeseries / IoT apps - Scenarios • Apps providing real-time

Scaling timeseries apps with Citus • Horizontal scale out -

Reference

Data-modeling with Citus

5 tips for distributed data-modeling with Postgres

TIP 1: Column that adds natural dimension to data is

github_events Event_id Event_type Event_hash Repo id User_id Repo id Payload

TIP 2: Co-locate large tables & make small tables reference

Example: App to analyze GitHub events github_events Event_id Event_type Event_hash

Power of Co-locating tables Items-4 Items-4 Items-4 events_4 countries Items-4

github_events Event_id Event_type Event_hash Repo id User_id Repo id Payload

TIP 4: Timeseries partitioning on distributed tables github_events Event_id Event_type

Timeseries APIs for partitioning, archival, & expiry • Automatic Data

TIP 5: Scope queries to shard key where possible •

5 tips for distributed data-modeling with Postgres • TIP 1:

Running app in production with Citus

Line item Open Source / Self managed Azure Cosmos DB

DEMO of Azure Cosmos DB for PostgreSQL (aka Citus on

@saisrirampur @AzureCosmosDB @citusdata