Slide 1

Slide 1 text

Architecting operational real- time analytics apps with PostgreSQL & Hyperscale (Citus) Deepthi Anantharam Shashikant Shakya Parikshit Savjani Postgres Team at Microsoft @AzureDBPostgres @citusdata

Slide 2

Slide 2 text

Session learning objective Architect a database for operational real-time analytics with Hyperscale (Citus) from schema <> data model <> rollups <> PowerBI

Slide 3

Slide 3 text

How many of you are familiar with Citus extension on Postgres?

Slide 4

Slide 4 text

How many of them have operational real-time analytics workloads?

Slide 5

Slide 5 text

What we mean by operational real-time analytics End-users (lots of them) can run reports & complex analytics queries on-the-fly, on “fresh” (real-time) data

Slide 6

Slide 6 text

Examples of operational real-time analytics Network telemetry Clickstream analysis IOT

Slide 7

Slide 7 text

5 requirements that real-time analytics applications all have • Large amounts of data • Sub-second response times • Large number of concurrent users • Ingests new data within seconds / minutes • Supports advanced analytics

Slide 8

Slide 8 text

Why Postgres for real-time analytics? 6 reasons • Both real-time ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)

Slide 9

Slide 9 text

Why Postgres for real-time analytics? 6 reasons • Both real-time ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)

Slide 10

Slide 10 text

Why Postgres for real-time analytics? 6 reasons • Both real-time ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)

Slide 11

Slide 11 text

Why Postgres for real-time analytics? 6 reasons • Both real-time ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)

Slide 12

Slide 12 text

Why Postgres for real-time analytics? 6 reasons • Both real-time ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)

Slide 13

Slide 13 text

Why Postgres for real-time analytics? 6 reasons • Both real-time ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)

Slide 14

Slide 14 text

2 deployment options for Azure Database for PostgreSQL: Single server & Hyperscale (Citus)

Slide 15

Slide 15 text

Scaling out your Postgres database horizontally Each node has PostgreSQL with Citus installed. Each shard [ ] is 1 PostgreSQL table. Gives your app more compute, memory, disk. Easy to add nodes for growth or performance. Coordinator node

Slide 16

Slide 16 text

Distributed PostgreSQL scales better APPLICATION SELECT FROM GROUP BY company_id, avg(spend) AS avg_campaign_spend compaigns company_id; METADATA COORDINATOR NODE WORKER NODES W1 W2 W3 … Wn SELECT company_id sum(spend), count(spend) … FROM campaigns_2009 … SELECT company_id sum(spend), count(spend) … FROM campaigns_2001 … SELECT company_id sum(spend), count(spend) … FROM campaigns_2017 … DISTRIBUTED AGGREGATE

Slide 17

Slide 17 text

Microsoft Windows team relies on Citus and Postgres (on Azure) for mission-critical shiproom decisions

Slide 18

Slide 18 text

“We can support 100s of concurrent users & more than 6M queries every day. With Citus, response times for 75% of queries are less than 200 ms. And response times for 95% of queries are less than 3 seconds.”

Slide 19

Slide 19 text

100s of concurrent users 6M queries every day 75% of queries < 200 ms. 95% of queries < 3 sec

Slide 20

Slide 20 text

“Distributed PostgreSQL with Citus is a game changer” https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/architecting- petabyte-scale-analytics-by-scaling-out-postgres-on/ba-p/969685

Slide 21

Slide 21 text

~10x faster vs. Elasticsearch Powerful indexing: semi-structured data Focus on app by offloading management w/PaaS 300 GB+ data Majority of queries < 2 sec

Slide 22

Slide 22 text

PostgreSQL tips for real-time analytics applications • Azure Database for PostgreSQL as your fully- managed database • COPY to load large amounts of raw data • Indexes to find new events during queries & pre- aggregations • Rollup tables built from raw event data • Incremental aggregation if you can have late data • Partitioning with pg_partman to expire old data • Automate pipeline with pg_cron • HLL, TopN to incrementally approximate complex queries • Hyperscale (Citus) to scale out

Slide 23

Slide 23 text

It’s time for the Technical, Hands-on Lab

Slide 24

Slide 24 text

Scenario: Cloud services provider Helping businesses monitor HTTP traffic Every time client receives HTTP request, your service receives a log record Ingest records & create HTTP analytics dashboard to give your clients insights Fast queries for results in real-time Analyze multiple sites at once OR a analyze single site at a time

Slide 25

Slide 25 text

Technical Lab—the steps involved UNDERSTAND THE SCENARIO SETUP HYPERSCALE (CITUS) SERVER GROUP & FIREWALL DESIGN DATA MODEL AND SHARDING CREATE ROLLUP TABLES EXPIRE OLDER DATA USE APPROXIMATIONS (HLL, TOPN) FOR FAST QUERIES CONNECT WITH POWERBI FOR POWERFUL VISUALIZATIONS

Slide 26

Slide 26 text

Let’s get started! Architecting operational real-time analytics apps with PostgreSQL & Hyperscale (Citus) aka.ms/AA7dv4k Activation Code: ACTIVATE5349

Slide 27

Slide 27 text

Session takeaways 1. Use Azure Database for PostgreSQL with Hyperscale (Citus) to scale out the database horizontally—to deliver fast performance to real- time analytics applications 2. Use COPY to load large amounts of raw data 3. Use Indexes to find new events during queries & pre-aggregations 4. Use Rollup tables built from raw event data 5. Use Incremental aggregation if you can have late data 6. Use Partitioning with pg_partman to expire old data 7. Automate the data pipeline with pg_cron 8. Use HLL, TopN to incrementally approximate complex queries

Slide 28

Slide 28 text

Session resources ž Azure Postgres service page: aka.ms/azure-postgres ž Azure Postgres / TechCommunity blog: aka.ms/azure-postgres-blog ž Citus technical newsletter: aka.ms/citus-newsletter ž Citus open source repo: github.com/citusdata/citus ž Follow us @AzureDBPostgres: twitter.com/AzureDBPostgres ž Follow @citusdata: twitter.com/citusdata

Slide 29

Slide 29 text

CITUSDATA.COM/NEWSLETTER Do you get the Citus Newsletter?

Slide 30

Slide 30 text

A good newsletter is like a good GIN index. Sign up for the Citus Newsletter

Slide 31

Slide 31 text

@talktosavjani • @citusdata • @AzureDBPostgres [email protected] ध"यवाद | Thank U