$30 off During Our Annual Pro Sale. View Details »

Architecting operational real-time analytics apps with PostgreSQL & Hyperscale (Citus) | PGConf India 2020 | Parikshit Savjani

Architecting operational real-time analytics apps with PostgreSQL & Hyperscale (Citus) | PGConf India 2020 | Parikshit Savjani

Data is exploding with billions of device connected and emitting petabytes scale of data every minute. Postgres is the world’s most advanced open source relational database with NoSQL capabilities built in—and is growing in usage globally, including for business-critical, real-time streaming and operational analytics applications. With Hyperscale (Citus) now available as a built-in deployment option in Azure Database for PostgreSQL, you can architect Hybrid transactional analytical applications on Postgres at any scale on Azure. In this hands-on lab, you will build an operational real-time analytics & visualization app using PowerBI and Azure Database for PostgreSQL. Come and attend this workshop to architect a database for operational real-time analytics with Hyperscale (Citus)—from schema to data model to building real time dashboards in PowerBI.

Azure Database for PostgreSQL

February 26, 2020
Tweet

More Decks by Azure Database for PostgreSQL

Other Decks in Technology

Transcript

  1. Architecting operational real-
    time analytics apps with
    PostgreSQL & Hyperscale (Citus)
    Deepthi Anantharam
    Shashikant Shakya
    Parikshit Savjani
    Postgres Team at Microsoft
    @AzureDBPostgres @citusdata

    View Slide

  2. Session learning objective
    Architect a database for operational real-time analytics with
    Hyperscale (Citus)
    from schema <> data model <> rollups <> PowerBI

    View Slide

  3. How many of you are familiar with Citus extension
    on Postgres?

    View Slide

  4. How many of them have operational real-time
    analytics workloads?

    View Slide

  5. What we mean
    by operational
    real-time
    analytics
    End-users (lots of them) can run
    reports & complex analytics queries
    on-the-fly, on “fresh” (real-time) data

    View Slide

  6. Examples of operational real-time analytics
    Network telemetry Clickstream analysis IOT

    View Slide

  7. 5 requirements
    that real-time
    analytics
    applications all
    have
    • Large amounts of data
    • Sub-second response times
    • Large number of concurrent users
    • Ingests new data within seconds /
    minutes
    • Supports advanced analytics

    View Slide

  8. Why
    Postgres for
    real-time
    analytics?
    6 reasons
    • Both real-time ingest & complex analytics
    in same database
    • Data types, operators, functions
    • Pre-aggregations: INSERT … SELECT,
    unlogged tables, temp tables
    • Extensions: pg_partman, pg_cron, hll, topn
    • Connectivity with rest of data pipeline
    • Scale to very large data volumes (100s of
    TBs) with Hyperscale (Citus)

    View Slide

  9. Why
    Postgres for
    real-time
    analytics?
    6 reasons
    • Both real-time ingest & complex analytics
    in same database
    • Data types, operators, functions
    • Pre-aggregations: INSERT … SELECT,
    unlogged tables, temp tables
    • Extensions: pg_partman, pg_cron, hll, topn
    • Connectivity with rest of data pipeline
    • Scale to very large data volumes (100s of
    TBs) with Hyperscale (Citus)

    View Slide

  10. Why
    Postgres for
    real-time
    analytics?
    6 reasons
    • Both real-time ingest & complex analytics
    in same database
    • Data types, operators, functions
    • Pre-aggregations: INSERT … SELECT,
    unlogged tables, temp tables
    • Extensions: pg_partman, pg_cron, hll, topn
    • Connectivity with rest of data pipeline
    • Scale to very large data volumes (100s of
    TBs) with Hyperscale (Citus)

    View Slide

  11. Why
    Postgres for
    real-time
    analytics?
    6 reasons
    • Both real-time ingest & complex analytics
    in same database
    • Data types, operators, functions
    • Pre-aggregations: INSERT … SELECT,
    unlogged tables, temp tables
    • Extensions: pg_partman, pg_cron, hll, topn
    • Connectivity with rest of data pipeline
    • Scale to very large data volumes (100s of
    TBs) with Hyperscale (Citus)

    View Slide

  12. Why
    Postgres for
    real-time
    analytics?
    6 reasons
    • Both real-time ingest & complex analytics
    in same database
    • Data types, operators, functions
    • Pre-aggregations: INSERT … SELECT,
    unlogged tables, temp tables
    • Extensions: pg_partman, pg_cron, hll, topn
    • Connectivity with rest of data pipeline
    • Scale to very large data volumes (100s of
    TBs) with Hyperscale (Citus)

    View Slide

  13. Why
    Postgres for
    real-time
    analytics?
    6 reasons
    • Both real-time ingest & complex analytics
    in same database
    • Data types, operators, functions
    • Pre-aggregations: INSERT … SELECT,
    unlogged tables, temp tables
    • Extensions: pg_partman, pg_cron, hll, topn
    • Connectivity with rest of data pipeline
    • Scale to very large data volumes (100s of
    TBs) with Hyperscale (Citus)

    View Slide

  14. 2 deployment options for Azure Database for
    PostgreSQL: Single server & Hyperscale (Citus)

    View Slide

  15. Scaling out your Postgres database horizontally
    Each node has PostgreSQL with Citus installed. Each shard [ ] is 1 PostgreSQL table.
    Gives your app more compute, memory, disk. Easy to add nodes
    for growth or performance.
    Coordinator
    node

    View Slide

  16. Distributed PostgreSQL
    scales better
    APPLICATION
    SELECT
    FROM
    GROUP BY
    company_id,
    avg(spend) AS avg_campaign_spend
    compaigns
    company_id;
    METADATA
    COORDINATOR NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    SELECT company_id
    sum(spend),
    count(spend) …
    FROM
    campaigns_2009 …
    SELECT company_id
    sum(spend),
    count(spend) …
    FROM
    campaigns_2001 …
    SELECT company_id
    sum(spend),
    count(spend) …
    FROM
    campaigns_2017 …
    DISTRIBUTED AGGREGATE

    View Slide

  17. Microsoft Windows team relies on Citus and
    Postgres (on Azure) for mission-critical shiproom
    decisions

    View Slide

  18. “We can support 100s
    of concurrent users &
    more than 6M queries
    every day. With Citus,
    response times for 75%
    of queries are less than
    200 ms. And response
    times for 95% of
    queries are less than 3
    seconds.”

    View Slide

  19. 100s of concurrent users
    6M queries every day
    75% of queries < 200 ms.
    95% of queries < 3 sec

    View Slide

  20. “Distributed PostgreSQL with Citus is a game
    changer”
    https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/architecting-
    petabyte-scale-analytics-by-scaling-out-postgres-on/ba-p/969685

    View Slide

  21. ~10x faster vs. Elasticsearch
    Powerful indexing:
    semi-structured data
    Focus on app by offloading
    management w/PaaS
    300 GB+ data
    Majority of queries < 2 sec

    View Slide

  22. PostgreSQL tips
    for real-time
    analytics
    applications
    • Azure Database for PostgreSQL as your fully-
    managed database
    • COPY to load large amounts of raw data
    • Indexes to find new events during queries & pre-
    aggregations
    • Rollup tables built from raw event data
    • Incremental aggregation if you can have late data
    • Partitioning with pg_partman to expire old data
    • Automate pipeline with pg_cron
    • HLL, TopN to incrementally approximate complex
    queries
    • Hyperscale (Citus) to scale out

    View Slide

  23. It’s time for the Technical, Hands-on Lab

    View Slide

  24. Scenario:
    Cloud services
    provider
    Helping
    businesses
    monitor HTTP
    traffic
    Every time client receives HTTP request,
    your service receives a log record
    Ingest records & create HTTP analytics
    dashboard to give your clients insights
    Fast queries for results in real-time
    Analyze multiple sites at once OR a
    analyze single site at a time

    View Slide

  25. Technical
    Lab—the
    steps involved
    UNDERSTAND THE
    SCENARIO
    SETUP HYPERSCALE
    (CITUS) SERVER GROUP
    & FIREWALL
    DESIGN DATA MODEL
    AND SHARDING
    CREATE ROLLUP TABLES
    EXPIRE OLDER DATA USE APPROXIMATIONS
    (HLL, TOPN) FOR FAST
    QUERIES
    CONNECT WITH
    POWERBI FOR
    POWERFUL
    VISUALIZATIONS

    View Slide

  26. Let’s get
    started!
    Architecting
    operational
    real-time
    analytics apps
    with
    PostgreSQL &
    Hyperscale
    (Citus)
    aka.ms/AA7dv4k
    Activation Code:
    ACTIVATE5349

    View Slide

  27. Session takeaways
    1. Use Azure Database for PostgreSQL with Hyperscale (Citus) to
    scale out the database horizontally—to deliver fast performance to real-
    time analytics applications
    2. Use COPY to load large amounts of raw data
    3. Use Indexes to find new events during queries & pre-aggregations
    4. Use Rollup tables built from raw event data
    5. Use Incremental aggregation if you can have late data
    6. Use Partitioning with pg_partman to expire old data
    7. Automate the data pipeline with pg_cron
    8. Use HLL, TopN to incrementally approximate complex queries

    View Slide

  28. Session resources
    ž Azure Postgres service page: aka.ms/azure-postgres
    ž Azure Postgres / TechCommunity blog: aka.ms/azure-postgres-blog
    ž Citus technical newsletter: aka.ms/citus-newsletter
    ž Citus open source repo: github.com/citusdata/citus
    ž Follow us @AzureDBPostgres: twitter.com/AzureDBPostgres
    ž Follow @citusdata: twitter.com/citusdata

    View Slide

  29. CITUSDATA.COM/NEWSLETTER
    Do you get the
    Citus Newsletter?

    View Slide

  30. A good newsletter is like a good GIN index.
    Sign up for the
    Citus Newsletter

    View Slide

  31. @talktosavjani • @citusdata • @AzureDBPostgres
    [email protected]
    ध"यवाद | Thank U

    View Slide