Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecting operational real-time analytics ap...

Architecting operational real-time analytics apps with PostgreSQL & Hyperscale (Citus) | PGConf India 2020 | Parikshit Savjani

Data is exploding with billions of device connected and emitting petabytes scale of data every minute. Postgres is the world’s most advanced open source relational database with NoSQL capabilities built in—and is growing in usage globally, including for business-critical, real-time streaming and operational analytics applications. With Hyperscale (Citus) now available as a built-in deployment option in Azure Database for PostgreSQL, you can architect Hybrid transactional analytical applications on Postgres at any scale on Azure. In this hands-on lab, you will build an operational real-time analytics & visualization app using PowerBI and Azure Database for PostgreSQL. Come and attend this workshop to architect a database for operational real-time analytics with Hyperscale (Citus)—from schema to data model to building real time dashboards in PowerBI.

Azure Database for PostgreSQL

February 26, 2020
Tweet

More Decks by Azure Database for PostgreSQL

Other Decks in Technology

Transcript

  1. Architecting operational real- time analytics apps with PostgreSQL & Hyperscale

    (Citus) Deepthi Anantharam Shashikant Shakya Parikshit Savjani Postgres Team at Microsoft @AzureDBPostgres @citusdata
  2. Session learning objective Architect a database for operational real-time analytics

    with Hyperscale (Citus) from schema <> data model <> rollups <> PowerBI
  3. What we mean by operational real-time analytics End-users (lots of

    them) can run reports & complex analytics queries on-the-fly, on “fresh” (real-time) data
  4. 5 requirements that real-time analytics applications all have • Large

    amounts of data • Sub-second response times • Large number of concurrent users • Ingests new data within seconds / minutes • Supports advanced analytics
  5. Why Postgres for real-time analytics? 6 reasons • Both real-time

    ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)
  6. Why Postgres for real-time analytics? 6 reasons • Both real-time

    ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)
  7. Why Postgres for real-time analytics? 6 reasons • Both real-time

    ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)
  8. Why Postgres for real-time analytics? 6 reasons • Both real-time

    ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)
  9. Why Postgres for real-time analytics? 6 reasons • Both real-time

    ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)
  10. Why Postgres for real-time analytics? 6 reasons • Both real-time

    ingest & complex analytics in same database • Data types, operators, functions • Pre-aggregations: INSERT … SELECT, unlogged tables, temp tables • Extensions: pg_partman, pg_cron, hll, topn • Connectivity with rest of data pipeline • Scale to very large data volumes (100s of TBs) with Hyperscale (Citus)
  11. Scaling out your Postgres database horizontally Each node has PostgreSQL

    with Citus installed. Each shard [ ] is 1 PostgreSQL table. Gives your app more compute, memory, disk. Easy to add nodes for growth or performance. Coordinator node
  12. Distributed PostgreSQL scales better APPLICATION SELECT FROM GROUP BY company_id,

    avg(spend) AS avg_campaign_spend compaigns company_id; METADATA COORDINATOR NODE WORKER NODES W1 W2 W3 … Wn SELECT company_id sum(spend), count(spend) … FROM campaigns_2009 … SELECT company_id sum(spend), count(spend) … FROM campaigns_2001 … SELECT company_id sum(spend), count(spend) … FROM campaigns_2017 … DISTRIBUTED AGGREGATE
  13. Microsoft Windows team relies on Citus and Postgres (on Azure)

    for mission-critical shiproom decisions
  14. “We can support 100s of concurrent users & more than

    6M queries every day. With Citus, response times for 75% of queries are less than 200 ms. And response times for 95% of queries are less than 3 seconds.”
  15. 100s of concurrent users 6M queries every day 75% of

    queries < 200 ms. 95% of queries < 3 sec
  16. ~10x faster vs. Elasticsearch Powerful indexing: semi-structured data Focus on

    app by offloading management w/PaaS 300 GB+ data Majority of queries < 2 sec
  17. PostgreSQL tips for real-time analytics applications • Azure Database for

    PostgreSQL as your fully- managed database • COPY to load large amounts of raw data • Indexes to find new events during queries & pre- aggregations • Rollup tables built from raw event data • Incremental aggregation if you can have late data • Partitioning with pg_partman to expire old data • Automate pipeline with pg_cron • HLL, TopN to incrementally approximate complex queries • Hyperscale (Citus) to scale out
  18. Scenario: Cloud services provider Helping businesses monitor HTTP traffic Every

    time client receives HTTP request, your service receives a log record Ingest records & create HTTP analytics dashboard to give your clients insights Fast queries for results in real-time Analyze multiple sites at once OR a analyze single site at a time
  19. Technical Lab—the steps involved UNDERSTAND THE SCENARIO SETUP HYPERSCALE (CITUS)

    SERVER GROUP & FIREWALL DESIGN DATA MODEL AND SHARDING CREATE ROLLUP TABLES EXPIRE OLDER DATA USE APPROXIMATIONS (HLL, TOPN) FOR FAST QUERIES CONNECT WITH POWERBI FOR POWERFUL VISUALIZATIONS
  20. Let’s get started! Architecting operational real-time analytics apps with PostgreSQL

    & Hyperscale (Citus) aka.ms/AA7dv4k Activation Code: ACTIVATE5349
  21. Session takeaways 1. Use Azure Database for PostgreSQL with Hyperscale

    (Citus) to scale out the database horizontally—to deliver fast performance to real- time analytics applications 2. Use COPY to load large amounts of raw data 3. Use Indexes to find new events during queries & pre-aggregations 4. Use Rollup tables built from raw event data 5. Use Incremental aggregation if you can have late data 6. Use Partitioning with pg_partman to expire old data 7. Automate the data pipeline with pg_cron 8. Use HLL, TopN to incrementally approximate complex queries
  22. Session resources ž Azure Postgres service page: aka.ms/azure-postgres ž Azure

    Postgres / TechCommunity blog: aka.ms/azure-postgres-blog ž Citus technical newsletter: aka.ms/citus-newsletter ž Citus open source repo: github.com/citusdata/citus ž Follow us @AzureDBPostgres: twitter.com/AzureDBPostgres ž Follow @citusdata: twitter.com/citusdata