Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Citus from the Customer eyes | Citus Con: An Ev...

Citus from the Customer eyes | Citus Con: An Event for Postgres 2023 | Sai Srirampur

As a Solutions Engineer for the Citus database extension for the past ~7.5 years, I have closely worked with many customers and onboarded them to run their applications on Citus and PostgreSQL. This talk will synthesize all those exciting customer experiences to present an end-to-end journey of what a typical customer onboarding looks like for both Citus open source and for Citus on Azure (now known as Azure Cosmos DB for PostgreSQL.)

This talk will cover steps starting from "product discovery"—where you get to know the value of Citus, to "proof-of-concept"—where you validate that value, to "migration"—where you implement the workload and go into production with Citus.

The following topics would be covered as a part of the talk:

Why Citus to distribute PostgreSQL?
Fit use cases and the misfit ones
Data-modeling (includes distributed data-modeling)
Performance tuning
Migration process
Value of the managed service offering and so on
The talk would be a perfect fit for any customer who is planning to evaluate or is already evaluating Citus for their real-world use case.

Citus Data

April 12, 2023
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. This talk is for 1. SaaS companies looking to scale

    with distributed PostgreSQL. 100s 10s 1000s
  2. Customer Journey - 4 steps 1. What is Citus? 2.

    When to use Citus? 3. Data-modeling with Citus 4. Running app in production with Citus
  3. Superpower of ”distributed tables” to scale-out Postgres Start Building apps

    on Single node Seamlessly scale to multiple nodes by distributing tables
  4. Packaged as extension, not a fork Postgres At Any Scale

    Open source & Fully managed Open source & fully-managed
  5. Global replication Security Distributed Scale High Availability Compliance Enterprise-grade monitoring

    Easy ecosystem integration Extension support Extensions JSONB Full text search Geospatial support Rich indexing Powered by fully Open Source Citus Extension to Postgres Azure Cosmos DB for PostgreSQL Fully managed D-SQL offering built upon core benefits of PostgreSQL
  6. Multi-tenant SaaS Apps – Scenarios • Expecting to outgrow single

    node soon. • Complex homegrown sharding to manage tenants • Incurring large tech debt • Inflexible to scale out • Inflexible to rebalance tenants
  7. Scaling multi-tenant SaaS apps with Citus • Automatic sharding of

    tenants across nodes • Online scale-out with shard rebalancer • Tenant Isolation • Comprehensive tenant monitoring • Row Level Security per tenant
  8. 11K+ Customers 800ms response times for 95% of queries 100

    requests per second 20 nodes 10B+ Total #rows 2 clusters in EU and US Reference
  9. Timeseries / IoT apps - Scenarios • Apps providing real-time

    insights on telemetry data • Examples: device telemetry, vehicle telemetry, logs, event data etc. • Have implicit scale requirement • Affinity to PostgreSQL
  10. Scaling timeseries apps with Citus • Horizontal scale out -

    Start small and scale out to multiple nodes • JSONB to store semi-structured data • Native timeseries APIs for partitioning, expiry and archival.
  11. TIP 1: Column that adds natural dimension to data is

    a good shard key • Represents a central piece of application • Lives at heart of application • Examples: • SaaS apps – customer_id is a good candidate • IoT app – the device_id is a good candidate More details on Distribution column and how Citus distributes data and queries
  12. github_events Event_id Event_type Event_hash Repo id User_id Repo id Payload

    Created_at org github_users User_id url login avatar Country_id gravatar Display login User_id is good candidate for distributing tables • Events generated by users themselves • Adds natural dimension & is at center of workload countries Country_id Name Continent Capital Example: App to analyze GitHub events
  13. Example: App to analyze GitHub events github_events Event_id Event_type Event_hash

    Repo id User_id Repo id Payload Created_at org github_users User_id url login avatar Country_id gravatar Display login countries Country_id Name Continent Capital Distributed PostgreSQL tables with co-location: Events & users distributed on user_id SELECT create_distributed_table( 'github_users', 'user_id'); SELECT create_distributed_table( 'github_events', 'user_id'); Reference table: SELECT create_reference_table( 'countries');
  14. Power of Co-locating tables Items-4 Items-4 Items-4 events_4 countries Items-4

    Items-4 Items-4 users_4 Items-4 Items-4 Items-4 events_8 Items-4 Items-4 Items-4 users_8 countries SQL pushdown - joins/fkeys Citus Worker 1 Citus Worker 2 Citus Coordinator countries events users
  15. github_events Event_id Event_type Event_hash Repo id User_id Repo id Payload

    (JSONB) Created_at org • Simplified schema – single table to store a variety of data • GIN indexing for fast JSONB access • NoSQL capabilities to Postgres world • Helpful in implementing workloads that need scale – Ex IoT, Events, Logs TIP 3: Use JSONB data-type to store unstructured data
  16. TIP 4: Timeseries partitioning on distributed tables github_events Event_id Event_type

    User_id Repo id User_id Repo id Payload (JSONB) Created_at org Distribute on user_id Partition on created_at github_events_jan Event_id Event_type User_id … github_events_feb Event_id Event_type User_id … github_events_sept Event_id Event_type User_id … github_events_aug Event_id Event_type User_id … github_events_july Event_id Event_type User_id … . . . . . . . Archiving older partitions using columnar Most recent partitions using row storage – real time ingests • Helps filtered querying with timestamp filters • Efficient expiry – DROP TABLE instead of DELETE • Helps archiving older data through columnar • Common in implementing timeseries workloads
  17. Timeseries APIs for partitioning, archival, & expiry • Automatic Data

    Partitioning – Easily Create partitions in past and future SELECT create_time_partitions( table_name:= 'time_series_events', partition_interval:= '1 day’, end_at:= '2021-10-30’, start_from:= '2021-10-10'); • Automatic Data Archival – Convert older partitions from row to columnar CALL alter_old_partitions_set_access_method('time_series_events’, '2021-10-20', 'columnar'); • Automatic Data Expiry - Expire data older than a given time period CALL drop_old_time_partitions( table_name:= 'time_series_events', older_than:= '2021-10-15');
  18. TIP 5: Scope queries to shard key where possible •

    Ex: SaaS workloads – queries can be scoped on tenant_id UPDATE ads SET impressions_count = impressions_count+1 WHERE id = 42; TO UPDATE ads SET impressions_count = impressions_count+1 WHERE id = 42 AND company_id = 1; • Avoids unnecessary fan out & helps query latency. • More relevant in transactional workloads
  19. 5 tips for distributed data-modeling with Postgres • TIP 1:

    Column that adds natural dimension to data is a good shard key • TIP 2: Co-locate large tables, make small tables reference • TIP 3: Use JSONB data-type to store unstructured data • TIP 4: Timeseries partitioning on distributed tables • TIP 5: Scope queries to shard key where possible
  20. Line item Open Source / Self managed Azure Cosmos DB

    for PostgreSQL App Migration Use 5-tips for data-modeling and modify app End-to-end migration guide Data Migration pg_dump/pg_restore, pgcopydb for online migration pg_azure_storage extension: Query and load data directly from blob. ADF, Azure Databricks, Azure Stream Analytics, any integration to PG. Deployment Docker images for dev/test Burstable Compute for dev/test ARM/BICEP templates High Availability Patroni HA for Citus Auto HA with SLAs Upgrades Open Source Docs Self-serve upgrades Disaster Recovery pg_dump/pg_restore pgbackrest, wal-e, wal-g Automatic Cross-AZ Backups + PITR Cross-region read replicas Infra Metrics Self-managed using any PG monitoring tool Azure Monitor Integration PG Bouncer Setup using community docs Pgbouncer support included DB tuning and troubleshooting Performance tuning 1, Performance tuning 2, Useful Diagnostic Queries Security Roles, RLS Roles, RLS, Azure Firewall, Azure Private Endpoint Scale Self-manage addition, removal of infra Use fully OSS shard rebalancer Out-of-box support for vertical and horizontal scale Checklist to run app in production with Citus