Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Tale of Three Real-Time OLAP Databases (Neha Pawar & Chinmay Soman, StarTree) | RTA Summit 2023

A Tale of Three Real-Time OLAP Databases (Neha Pawar & Chinmay Soman, StarTree) | RTA Summit 2023

Open source OLAP databases are getting increasingly popular in the last few years and have seen mainstream adoption in several big companies such as LinkedIn, Uber, Netflix, Cisco, Bloomberg and so on. Choosing the right OLAP database depends on several factors such as use case, ingestion constraints and methodology, query performance, and production readiness to name a few. In this talk, we present a high-level comparison of 3 such open-source OLAP databases: Apache Pinot, Apache Druid, and Clickhouse with these dimensions in mind. In addition, we will also look at certain salient features in-depth to uncover how things work under the hood across different systems.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. • Overview of real-time OLAP • High level architecture overview

    • Layers of a Real-time OLAP Database • Summary and deep-dive into each layer • Concluding remarks and takeaways Talk Structure
  2. Rise of Real-time OLAP Latency: seconds to minutes Freshness: Hours

    Concurrency: 100s of users Latency: Milliseconds Freshness: seconds Concurrency: Millions of users
  3. The three real-time OLAP databases Open Source Database Apache Pinot

    Apache Druid ClickHouse Version 0.12.0 Jan 19 2023 25.0.0 Jan 4 2023 23.1 Jan 26 2023
  4. The three real-time OLAP databases Open Source Database Apache Pinot

    Apache Druid ClickHouse Version 0.12.0 Jan 19 2023 25.0.0 Jan 4 2023 23.1 Jan 26 2023 Commercial DBaaS Product Offering StarTree Cloud Imply ClickHouse Cloud
  5. Why this matters Less components Easy getting started experience Separation

    of concerns Better scalability Inbuilt functionality for management and operability More operator friendly Better no-code experience Architecture Overview Summary
  6. Data Sources Apache Kafka Amazon Kinesis Google PubSub Apache Pulsar

    Change Data Capture (CDC) S3 GCS ADLS HDFS Databricks DeltaLake Apache Iceberg Apache Hudi Snowflake BigQuery MySQL Postgres StarTree only StarTree only StarTree only StarTree only
  7. Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored

    in batch sources Pre-Processing Filtering Transformation Rollups
  8. Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored

    in batch sources Pre-Processing Filtering Transformation Rollups
  9. Why this matters Pull Model Faster, efficient way of ingesting

    from real-time sources Push Model Required to get around connectivity issues or direct application writes Experimental Scalability Ingest at high message rate Limited Out of order Handle late arriving messages Exactly once guarantees Data, Query consistency Complex Setup Real-Time Ingestion
  10. Kafka Consumer Low-level • Accurate • Flexible • Scalable and

    fault tolerant High Level / Consumer group • Duplication • Scalability challenges • Not robust Ingestion Latency Very low: process tiny batches of data Moderate -> High: Large message batches recommended Scalability Easy to scale out Limited by Kafka partitions Data duplication (Ingestion) Cannot happen Can happen Replication complexity Easy: relying on Kafka offsets Difficult: Need Zookeeper/ClickHouse Keeper in the fast path Cloud Readiness Available in StarTree Cloud Not available in ClickHouse Cloud Real-Time Ingestion: Native Kafka support
  11. Why this matters Full Row upserts Data and Query accuracy

    Limited: async (sync is not recommended) Partial row upsert: Upsert certain columns based on primary key Data and Query accuracy De-duplication Data and Query accuracy Scalability Ability to handle high ingestion rate and lots of primary keys without sacrificing performance Real-Time Upserts
  12. Real-Time Upserts: Apache Pinot & StarTree Design Highlights • Low

    Query overhead: Reconciled at ingestion time • Query Performance: Per segment bitmaps help in filtering out obsolete records in an efficient manner • Scalability: Billions of primary keys per server and reduced heap usage Coming Soon ! • TTL support • Compaction • Native Bootstrap / Backfill support
  13. Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored

    in batch sources Pre-Processing Filtering Transformation Rollups
  14. Why this matters Pull Model Native support for ingesting data

    from S3, GCS, ADLS Push Model Ability to push large batches of data from Spark/Flink/… Data Format support Ability to handle wide variety of data format Exactly once / Ease of backfill Ability to replace historical data Batch Ingestion
  15. Why this matters Native bootstrap / One time ingestion ETL

    from SQL sources into real-time OLAP (one time) Native incremental ingestion Incremental ETL from SQL sources into real-time OLAP (hourly/daily/…) SQL Ingestion
  16. Batch/SQL support in Apache Pinot and StarTree Design Highlights •

    Scalable: Auto scaling for minions • Robust: Ingestion tasks are checkpointed and retried upon failures • Easy to use: No external dependecy
  17. Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored

    in batch sources Pre-Processing Filtering Transformation Rollups
  18. Pre-processing Why this matters Value/type transformation Filter, transform, flatten (nested)

    values before being ingested to speed up query performance Real-time Rollups Rollup real-time data to reduce data granularity for space and query optimization Batch Rollups Rollup batch data to reduce data granularity for space and query optimization SQL based transforms Use SQL to pre-process data Denormalization Combine data from multiple sources before ingestion
  19. Data Layout within Server Data Layout within Segment Partitioning Sorting

    Columnar Encoding and compression Data Layout within Cluster Storage tiers Storage Segment to server assignment
  20. Data Layout in Segment & Servers Why this matters Custom

    Partitioning Pruning at segment / server level Custom Sorting Better data locality, smarter encoding Can only be primary time column Smart data assignment on serving layer Reduce query fanout High level objective: Better p99 latency, increased throughput
  21. Why Data Layout helps Pinot Broker Query 1 2 5

    6 9 10 Server 1 Server 2 Server 3 3 4 7 8 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Broker level pruning 5 6 7 8 9 10 11 12 Server level pruning 1 2 3 4 5 6 7 8 9 10 11 12 Total segments to process Server 1 Server 2 Server 3
  22. Data Layout on Cluster - Tiered Storage 32 Server 1

    Server 2 SSD SSD Server 1 Server 2 HDD HDD Tiered Storage using compute node pools and local volumes Tiered, but still tightly coupled Server 1 Server 2 SSD SSD Cloud Object Storage Tiered Storage using Cloud Object Storage Fully decoupled
  23. Tiered Storage for Apache Pinot in StarTree Cloud Brokers Brokers

    Server 1 Server 2 Disk/SSD Disk/SSD Fully tightly-coupled Server 3 Server 4 Cloud Object Storage Fully decoupled Hybrid 33
  24. Tiered Storage using Cloud Object Storage Attributes of tiered storage

    Columnar / block fetch Prefetching Data altering Indexing Column level caching Index level caching
  25. Filter and aggregation optimizations Pinot Broker Query 1 2 5

    6 9 10 Server 1 Server 2 Server 3 3 4 7 8 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Broker level pruning 5 6 7 8 9 10 11 12 Server level pruning Filter optimizations Aggregation optimizations 1 2 3 4 5 6 7 8 9 10 11 12 Total segments to process Server 1 Server 2 Server 3
  26. Indexing • Minimize data scans • Keep query performance optimal

    and predictable ◦ High throughput ◦ Growing data size
  27. Index Purpose Inverted Fast filter Sorted Personalized analytics Range Range

    queries JSON Semi-structured data Geospatial Location queries Timestamp Timestamp rollups Text Log analytics Bloom Approximate filter Sparse Approximate filter
  28. 40 Star-tree Index name env_id type value ts cpu_core a357

    ... 20 ... mem 57f ... 70 ... instance 57f ... 12 ... cpu_core 57f ... 6 ... cpu_core a357 ... 9 ... … … ... ... ... Star-tree index Dimension split order = name, env_id Aggregations = sum(value) Max leaf records = 10 * cpu_core mem instance * name a357 57f * 57f a357 * < 10 records env_id 100 386 486 250 300 550 sum(value)
  29. 41 Star-tree Index select sum(value) from table where name =

    cpu_core and env_id = a357 Star-tree index Dimension split order = name, env_id Aggregations = sum(value) Max leaf records = 10 cpu_core mem instance * * a357 57f * 57f a357 * 386 486 250 300 550 100
  30. Materialized View vs Star-tree Index Star-tree index Materialized View Storage

    overhead Moderate: • Allows tuning space overhead High: • Materialize all combinations of selected dimensions Creation overhead Low: • Natively built like an index High: • Could need external data preprocessing workflow Backfill / sync overhead Low: • Can be created anytime • Easy to keep in sync High: • Could need external data preprocessing workflow • Challenges with keeping data in sync Query overhead Low: • Engine picks right index under the hood Moderate: • Queries need to target a specific view
  31. Query All aspects related to querying a database Query Caching

    Optimization to speedup repeated expensive queries Complex Query Constructs Constructs needed to run complex / multi-stage queries Basic Query Constructs Constructs needed to run basic OLAP queries
  32. Query: Basic Query Constructs Constructs needed to run basic OLAP

    queries Why this matters Filter Aggregation Group by Order by Bare minimum constructs needed for real-time OLAP Sketches Fast approximate counting on large distinct values
  33. Query: Complex Query Constructs Constructs needed to run complex /

    multi-stage queries Why this matters Pluggable UDFs Plug in custom UDFs Multi-stage query engine Execute complex queries without bottlenecks on single stage scatter gather Only for ingestion Window functions Aggregations over a window Joins Combine tables from multiple data sources during query
  34. Query: Query Caching Optimization to speedup repeated expensive queries Why

    this matters Broker cache Per query result cache Server cache Per segment result cache Eviction Evict stale entries as soon as data changes
  35. Query Performance How Zomato uses Apache Pinot Real-time Security Insights:

    Apache Pinot at Confluera Webex: Real-time Observability and Analytics with Apache Pinot Pinot in YouGov Chart comparing latency in ms
  36. Cluster Management Cluster operations Availability of easy mechanisms to perform

    common cluster operations Cluster setup Cluster setup options for isolation, high availability All aspects related to cluster operability
  37. Cluster Management: Cluster Setup Cluster setup options for isolation, high

    availability Why this matters Multi tenancy Host tables of multiple usecases on same cluster, with physical / logical isolations High availability Keep service available through partial component failures Separation of concerns using separate components, deep store
  38. Cluster Management: Cluster Operations Availability of easy mechanisms to support

    common cluster operations Why this matters Scale up / Scale out Make capacity changes to cluster and re-distribute data Node maintenance Replacing nodes / taking nodes down for maintenance Disaster recovery Permanent backup and recovery mechanisms to recover from cluster / node failures
  39. Purpose Authentication OIDC based authentication Kerberos based authentication Authorization ABAC

    RBAC Encryption Data encryption Storage Storage Storage + inbuilt Network encryption (SSL / TLS) Security
  40. Cloud Deployment Model Bring Your Own Cloud Cluster provisioned in

    customer’s cloud account Saas Cluster provisioned in vendor’s cloud account Ease of deploying as a fully-managed cloud solution, via commercial DBaaS providers
  41. Cloud Deployment Model Why this matters Saas Cluster provisioned in

    vendor’s cloud account - Customer ships data to this cluster Bring Your Own Cloud Cluster provisioned in customer’s cloud account - Customer retains full control of data Supported w/ limitations: Needs VPC Peering
  42. StarTree’s Bring Your Own Cloud • No VPC peering •

    Principle of least privilege • Data Security / Governance • SLA • Fully managed • Cost effective