Is PostgreSQL becoming THE real-time analytics database? | PGCon 2018 | Ozgun Erdogan

Is PostgreSQL becoming THE real-time analytics database? | PGCon 2018 | Ozgun Erdogan

024d6a0dd14fb31c804969a57a06dfbe?s=128

Citus Data

May 31, 2018
Tweet

Transcript

  1. Is PostgreSQL becoming THE real-time analytics database? Ozgun Erdogan Citus

    Data PGCon | May 2018
  2. What is real-time analytics? •  Real-time analytics is an emerging

    workload for databases. This workload is also known as in-memory databases or Hybrid Transactional Analytical Processing (HTAP). •  The real-time analytics workload includes use-cases such as powering customer facing dashboards, behavior analytics (segmentation and funnels), IoT / device analytics, time-series databases, and geospatial analysis.
  3. New databases for real-time analytics •  In past years, proprietary

    databases have been built to handle real-time analytics workloads. An example is SAP HANA. •  Postgres’ new features and extensions make it appealing and competitive for real-time analytics – while also providing the benefits of open source. •  This lightning talk showcases five exciting developments for real-time analytics workloads.
  4. (1) JIT compilation for faster queries •  Andres has been

    working on JIT compilation using LLVM since Postgres 9.6. These changes improve analytical query performance for in-memory datasets. Diagram from Andres’ FOSDEM presentation: http://anarazel.de/talks/ fosdem-2018-02-03/jit.pdf
  5. (2) Approximation algos for fast replies •  When your database

    powers customer facing dashboards, your analytical queries need to go over hundreds of millions of rows in less than a second. •  The HyperLogLog (HLL) extension provides an approximation algorithm to count the number of distinct items with tunable precision. •  The TopN extension allows you to store, merge, and serve approximations for order by-group by-limit queries.
  6. (3) Citus – Scale out Postgres S1 Citus Node #1

    PostgreSQL Primary WAL logs Monitoring Agents (Auto node failover) 10.0.0.15 (Elastic IP) Application (Ruby, Pyhton, Java, …) EBS volume S5 S7 S8 Shards (S1, S5, S7, S8) Transparent to application S2 Citus Node #8 PostgreSQL Primary WAL logs 10.0.0.25 (Elastic IP) EBS volume S3 S4 S9 ……. PgBou ncer PgBou ncer
  7. (4) Native partitioning & partition by time •  Postgres 10

    introduced native partitioning. •  Real-time analytics data has a time dimension to it. This makes it suitable to partition by time – so that your indexes remain local to your partitions and data expiration becomes simple. •  Common pattern: Shard by granular key, such as user_id or device_id. Then partition by time.
  8. (5) You can use these features together •  A common

    pattern in real-time analytics is hot updateable data and cold compressed data. •  You update data for the most recent partition for that day. You then “close out” the partition and rotate this data into a cold partition. •  With large data sets, you want to compress your cold partitions. CStore foreign data wrapper can sort and compress cold data.
  9. (5) Shard on granular key, partition by time, and cstore_fdw

    to compress cold data Image from Microsoft.com (SQL Server 2016)
  10. Conclusion: Postgres is awesome •  Proprietary databases built in features

    tailored to real- time analytics workloads. •  Three years ago at PGCon: •  https://www.citusdata.com/blog/2015/08/19/futuristic- pgshard-demo/ (Dynamically changing row and columnar store in Postgres) •  The future is today!